Tuesday, November 29, 2011

Regular expressions simplify pattern-matching code - 25


A practical application of regexes

Regexes let you create powerful text-processing applications. One application you might find helpful extracts comments from a Java, C, or C++ source file, and records those comments in another file. Listing 2 presents that application's source code:
Listing 2. ExtCmnt.java
// ExtCmnt.java
import java.io.*;
import java.util.regex.*;
class ExtCmnt
{
   public static void main (String [] args)
   {
      if (args.length != 2)
      {
          System.err.println ("usage: java ExtCmnt infile outfile");
          return;
      }
      Pattern p;
      try
      {
         // The following pattern lets this extract multiline comments that
         // appear on a single line (e.g., /* same line */) and single-line
         // comments (e.g., // some line). Furthermore, the comment may
         // appear anywhere on the line.
         p = Pattern.compile (".*/\\*.*\\*/|.*//.*$");
      }
      catch (PatternSyntaxException e)
      {
         System.err.println ("Regex syntax error: " + e.getMessage ());
         System.err.println ("Error description: " + e.getDescription ());
         System.err.println ("Error index: " + e.getIndex ());
         System.err.println ("Erroneous pattern: " + e.getPattern ());
         return;
      }
      BufferedReader br = null;
      BufferedWriter bw = null;
      try
      {
          FileReader fr = new FileReader (args [0]);
          br = new BufferedReader (fr);
          FileWriter fw = new FileWriter (args [1]);
          bw = new BufferedWriter (fw);
          Matcher m = p.matcher ("");
          String line;
          while ((line = br.readLine ()) != null)
          {
             m.reset (line);
             if (m.matches ()) /* entire line must match */
             {
                 bw.write (line);
                 bw.newLine ();
             }
          }
      }
      catch (IOException e)
      {
          System.err.println (e.getMessage ());
          return;
      }
      finally // Close file.
      {
          try
          {
              if (br != null)
                  br.close ();
              if (bw != null)
                  bw.close ();
          }
          catch (IOException e)
          {
          }
      }
   }
}

After creating Pattern and Matcher objects, ExtCmnt reads a text file's contents line by line. For each read line, the matcher attempts to match that line against a pattern, identifying either a single-line comment or a multiline comment that appears on a single line. If the line matches the pattern, ExtCmnt writes that line to another text file. For example, java ExtCmnt ExtCmnt.java out reads each ExtCmnt.java line, attempts to match that line against the pattern, and outputs matched lines to a file named out. (Don't worry about understanding the file reading and writing logic. I will explore that logic in a future article.) After ExtCmnt completes, out contains the following lines:
// ExtCmnt.java
         // The following pattern lets this extract multiline comments that
         // appear on a single line (e.g., /* same line */) and single-line
         // comments (e.g., // some line). Furthermore, the comment may
         // appear anywhere on the line.
         p = Pattern.compile (".*/\\*.*\\*/|.*//.*$");
             if (m.matches ()) /* entire line must match */
      finally // Close file.

The output shows that ExtCmnt is not perfect: p = Pattern.compile (".*/\\*.*\\*/|.*//.*$"); doesn't represent a comment. That line appears in out because ExtCmnt's matcher matches the // characters.
There is something interesting about the pattern in ".*/\\*.*\\*/|.*//.*$": the vertical bar metacharacter (|). According to the SDK documentation, the parentheses metacharacters in a capturing group and the vertical bar metacharacter are logical operators. The vertical bar tells a matcher to use that operator's left regex construct operand to locate a match in the matcher's text. If no match exists, the matcher uses that operator's right regex construct operand in another match attempt.

No comments: