For beginners:
Python Regex HOWTO
&
Python Regex Syntax.
For when the details matter:
PCRE Syntax
(what jiten actually uses).
Prefix "Commands"
Queries support prefix "commands" (unrelated to regex syntax): e.g.
+w foo
(word) should give the same results as
\bfoo\b
,
+1 foo
(1st word) as
^foo\b
, and
+= foo
(exact) as
^foo$
. (The
+
prefix was chosen because
no valid regex can start with it.)
Quick Reference
-
Most letters and characters will simply match themselves; e.g.
foo
matches "foo".
-
The metacharacters
. ^ $ * + ? { } [ ] \ | ( )
allow
matching using sophisticated patterns and rules; they are escaped
by preceding them with a backslash; e.g. \\
matches
an actual backslash.
-
.
matches any character (except newline); e.g.
ba.
matches "bar", "baz", etc.
-
^
matches at the start of a line; e.g.
^foo
matches "foo" at the start of a line.
-
$
matches at the end of a line; e.g.
foo$
matches "foo" at the end of a line.
-
\b
matches at a "word boundary" (the start or end of
a "word"); e.g. \bbar\b
matches "bar" in "foo bar
baz", but not in "foobarbaz"; \B
is its complement.
-
*
matches the preceding thing zero or more times;
e.g. fo*
matches "f", "fo", "foo", etc.
-
+
matches the preceding thing one or more times; e.g.
fo+
matches "fo", "foo", etc.
-
?
matches the preceding thing optionally; e.g.
fo?
matches "f" or "fo".
-
{m,n}
(or just {m}
instead of
{m,m}
) matches the preceding thing m
to
n
times; e.g. fo{2,4}
matches "foo",
"fooo", or "foooo".
-
[...]
is a character class; e.g. [a-z]
matches "a" through "z"; [あいうえお]
matches "あ",
"い", "う", "え", or "お".
-
[^...]
is a complementing character class; e.g.
[^a-z]
matches anything but "a" through "z".
-
\d
matches any decimal digit (equivalent to
[0-9]
); \D
is its complement (equivalent
to [^0-9]
).
-
\s
matches any whitespace character; \S
is its complement.
-
\w
matches any alphanumeric (i.e. "word" or "letter")
character; \W
is its complement.
-
|
is alternation; e.g. foo|bar|baz
matches "foo", "bar", or "baz".
-
(...)
is grouping; e.g. ab*
matches "a",
"ab", "abb", etc. whereas (ab)*
matches "", "ab",
"abab", etc.; ^foo|bar$
matches "foo" at the
beginning of a line or "bar" at the end, whereas
^(foo|bar)$
matches either "foo" or "bar" as a whole
line; a backslash followed by the number of the group (starting
from 1) can be used later in the pattern to refer back to what it
(actually) matched; e.g. [a-z]{2}
matches "aa",
"ab", "za", etc. whereas ([a-z])\1
matches "aa",
"bb", etc. (but not "ab" or "za").
-
\p{...}
matches a unicode property; e.g.
\p{Han}
matches kanji, \p{Hiragana}
matches hiragana, and \p{Katakana}
matches katakana;
\P{...}
is its complement.
-
For easy matching of Japanese, jiten supports these non-standard
aliases:
\pK
for \p{Han}
,
\ph
for \p{Hiragana}
, and
\pk
for \p{Katakana}
.
Examples
-
+w cat
(\bcat\b
) matches "cat" in "the
cat" (but not in e.g. "indicates").
-
+1 cat
(^cat\b
) matches "cat" in "cat"
or "cat (esp. the domestic cat, Felis catus)" (but not in e.g.
"category").
-
+= cat
(^cat$
) matches "cat" exactly.
-
+= 猫\pK
(^猫\pK$
) matches "猫" followed
by exactly one other kanji.
-
+= (\pK)\1
(^(\pK)\1$
) matches e.g.
"人人".