For beginners:
Python Regex HOWTO
&
Python Regex Syntax.
For when the details matter:
PCRE Syntax
(what jiten actually uses).
Prefix "Commands"
Queries support prefix "commands" (unrelated to regex syntax): e.g.
+w foo (word) should give the same results as
\bfoo\b,
+1 foo (1st word) as
^foo\b, and
+= foo (exact) as
^foo$. (The
+ prefix was chosen because
no valid regex can start with it.)
Quick Reference
-
Most letters and characters will simply match themselves; e.g.
foo matches "foo".
-
The metacharacters
. ^ $ * + ? { } [ ] \ | ( ) allow
matching using sophisticated patterns and rules; they are escaped
by preceding them with a backslash; e.g. \\ matches
an actual backslash.
-
. matches any character (except newline); e.g.
ba. matches "bar", "baz", etc.
-
^ matches at the start of a line; e.g.
^foo matches "foo" at the start of a line.
-
$ matches at the end of a line; e.g.
foo$ matches "foo" at the end of a line.
-
\b matches at a "word boundary" (the start or end of
a "word"); e.g. \bbar\b matches "bar" in "foo bar
baz", but not in "foobarbaz"; \B is its complement.
-
* matches the preceding thing zero or more times;
e.g. fo* matches "f", "fo", "foo", etc.
-
+ matches the preceding thing one or more times; e.g.
fo+ matches "fo", "foo", etc.
-
? matches the preceding thing optionally; e.g.
fo? matches "f" or "fo".
-
{m,n} (or just {m} instead of
{m,m}) matches the preceding thing m to
n times; e.g. fo{2,4} matches "foo",
"fooo", or "foooo".
-
[...] is a character class; e.g. [a-z]
matches "a" through "z"; [あいうえお] matches "あ",
"い", "う", "え", or "お".
-
[^...] is a complementing character class; e.g.
[^a-z] matches anything but "a" through "z".
-
\d matches any decimal digit (equivalent to
[0-9]); \D is its complement (equivalent
to [^0-9]).
-
\s matches any whitespace character; \S
is its complement.
-
\w matches any alphanumeric (i.e. "word" or "letter")
character; \W is its complement.
-
| is alternation; e.g. foo|bar|baz
matches "foo", "bar", or "baz".
-
(...) is grouping; e.g. ab* matches "a",
"ab", "abb", etc. whereas (ab)* matches "", "ab",
"abab", etc.; ^foo|bar$ matches "foo" at the
beginning of a line or "bar" at the end, whereas
^(foo|bar)$ matches either "foo" or "bar" as a whole
line; a backslash followed by the number of the group (starting
from 1) can be used later in the pattern to refer back to what it
(actually) matched; e.g. [a-z]{2} matches "aa",
"ab", "za", etc. whereas ([a-z])\1 matches "aa",
"bb", etc. (but not "ab" or "za").
-
\p{...} matches a unicode property; e.g.
\p{Han} matches kanji, \p{Hiragana}
matches hiragana, and \p{Katakana} matches katakana;
\P{...} is its complement.
-
For easy matching of Japanese, jiten supports these non-standard
aliases:
\pK for \p{Han},
\ph for \p{Hiragana}, and
\pk for \p{Katakana}.
Examples
-
+w cat (\bcat\b) matches "cat" in "the
cat" (but not in e.g. "indicates").
-
+1 cat (^cat\b) matches "cat" in "cat"
or "cat (esp. the domestic cat, Felis catus)" (but not in e.g.
"category").
-
+= cat (^cat$) matches "cat" exactly.
-
+= 猫\pK (^猫\pK$) matches "猫" followed
by exactly one other kanji.
-
+= (\pK)\1 (^(\pK)\1$) matches e.g.
"人人".