Sunday, November 13, 2011

Regular expressions simplify pattern-matching code - 9


Capturing groups

Pattern supports a regex construct called a capturing group that saves a match's characters for later recall during pattern matching; that construct is a character sequence surrounded by parentheses metacharacters (( )). All characters within that capturing group are treated as a single unit during pattern matching. For example, the (Java) capturing group combines letters J, a, v, and a into a single unit. This capturing group matches the Java pattern against all occurrences of Java in text. Each match replaces the previous match's saved Java characters with the next match's Java characters.
Capturing groups can nest inside other capturing groups. For example, in (Java( language)), ( language) nests inside (Java). Each nested or nonnested capturing group receives its own number, numbering starts at 1, and capturing groups number from left to right. In the example, (Java( language)) is capturing group number 1, and ( language) is capturing group number 2. In (a)(b), (a) is capturing group number 1, and (b) is capturing group number 2.
Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group's captured text characters. The presence of a back reference causes a matcher to use the back reference's capturing group number to recall the capturing group's saved match and then use that match's characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:
java RegexDemo "(Java( language)\2)" "The Java language language"

The example uses the (Java( language)\2) regex to search the text The Java language language for a grammatical error, where Java immediately precedes two consecutive occurrences of language. That regex specifies two capturing groups: number 1 is (Java( language)\2), which matches Java language language, and number 2 is ( language), which matches a space character followed by language. The \2 back reference recalls number 2's saved match, which allows the matcher to search for a second occurrence of a space character followed by language, which immediately follows the first occurrence of the space character and language. The following output shows what RegexDemo's matcher finds:
Regex = (Java( language)\2)
Text = The Java language language
Found Java language language
  starting at index 4 and ending at index 26

No comments: