Saturday, November 12, 2011

Regular expressions simplify pattern-matching code - 8


Predefined character classes

Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides such shortcuts with predefined character classes, which Table 1 presents. Use predefined character classes to simplify your regexes and minimize regex syntax errors.
Table 1. Predefined character classes
Predefined character class
Description
\d
A digit. Equivalent to [0-9].
\D
A nondigit. Equivalent to [^0-9].
\s
A whitespace character. Equivalent to [ \t\n\x0B\f\r].
\S
A nonwhitespace character. Equivalent to [^\s].
\w
A word character. Equivalent to [a-zA-Z_0-9].
\W
A nonword character. Equivalent to [^\w].

The following command-line example uses the \w predefined character class to identify all word characters in its text command-line argument:
java RegexDemo \w "aZ.8 _"

The command line above produces the following output, which shows that the period and space characters are not considered word characters:
Regex = \w
Text = aZ.8 _
Found a
  starting at index 0 and ending at index 1
Found Z
  starting at index 1 and ending at index 2
Found 8
  starting at index 3 and ending at index 4
Found _
  starting at index 5 and ending at index 6

Note
Pattern's SDK documentation refers to the period metacharacter as a predefined character class that matches any character except for a line terminator—a one- or two-character sequence identifying the end of a text line—unless dotall mode (discussed later) is in effect. Pattern recognizes the following line terminators:
  • The carriage-return character (\r\)
  • The new-line (line feed) character (\n)
  • The carriage-return character immediately followed by the new-line character (\r\n)
  • The next-line character (\u0085)
  • The line-separator character (\u2028)
  • The paragraph-separator character (\u2029)

No comments: