Friday, November 18, 2011

Regular expressions simplify pattern-matching code - 14


The following command-line example uses the ^ boundary matcher metacharacter to ensure that a line begins with The followed by zero or more word characters:
java RegexDemo ^The\w* Therefore

^ indicates that the first three text characters must match the pattern's subsequent T, h, and e characters. Any number of word characters may follow. The command line above produces the following output:
Regex = ^The\w*
Text = Therefore
Found Therefore
  starting at index 0 and ending at index 9

Change the command line to java RegexDemo ^The\w* " Therefore". What happens? No match is found because a space character precedes Therefore.

Embedded flag expressions

Matchers assume certain defaults, such as case-sensitive pattern matching. A program may override any default by using an embedded flag expression, that is, a regex construct specified as parentheses metacharacters surrounding a question mark metacharacter (?) followed by a specific lowercase letter. Pattern recognizes the following embedded flag expressions:
  • (?i): enables case-insensitive pattern matching. Example: java RegexDemo (?i)tree Treehouse matches tree with Tree. Case-sensitive pattern matching is the default.
  • (?x): permits whitespace and comments beginning with the # metacharacter to appear in a pattern. A matcher ignores both. Example: java RegexDemo ".at(?x)#match hat, cat, and so on" matter matches .at with mat. By default, whitespace and comments are not permitted; a matcher regards them as characters that contribute to a match.
  • (?s): enables dotall mode. In that mode, the period metacharacter matches line terminators in addition to any other character. Example: java RegexDemo (?s). \n matches . with \n. Nondotall mode is the default: line-terminator characters do not match.
  • (?m): enables multiline mode. In multiline mode, ^ and $ match just after or just before (respectively) a line terminator or the text's end. Example: java RegexDemo (?m)^.ake make\rlake\n\rtake matches .ake with make, lake, and take. Non-multiline mode is the default: ^ and $ match only at the beginning and end of the entire text.
  • (?u): enables Unicode-aware case folding. This flag works with (?i) to perform case-insensitive matching in a manner consistent with the Unicode Standard. The default: case-insensitive matching that assumes only characters in the US-ASCII character set match.
(?d): enables Unix lines mode. In that mode, a matcher recognizes only the \n line terminator in the context of the ., ^, and $ metacharacters. Non-Unix lines mode is the default: a matcher recognizes all terminators in the context of the aforementioned metacharacters.

No comments: