Monday, December 26, 2011

Java's character and assorted string classes support text-processing - 26


Token extraction
StringTokenizer provides four methods for extracting tokens: public int countTokens(), public boolean hasMoreTokens(), public String nextToken(), and public String nextToken(String delim). The countTokens() method returns an integer containing a count of a string's tokens. Use this return value to determine the maximum tokens to extract. However, you should call hasMoreTokens() to determine when to end tokenizing because countTokens() is undependable (as you will see). hasMoreTokens() returns a Boolean true value if at least one more token exists to extract. Otherwise, that method returns false. Finally, the nextToken() and nextToken(String delim) methods return a String's next token. But if no more tokens are available, either method throws a NoSuchElementException object. nextToken() and nextToken(String delim) differ only in that nextToken(String delim) lets you reset a StringTokenizer's delimiter characters to those characters in the delim-referenced String. Given this information, the following code, which builds on the previous fragment, shows how to use the previous three StringTokenizers to extract a string's tokens:
System.out.println ("count1 = " + stok1.countTokens ());
while (stok1.hasMoreTokens ())
   System.out.println ("token = " + stok1.nextToken ());
System.out.println ("\r\ncount2 = " + stok2.countTokens ());
while (stok2.hasMoreTokens ())
   System.out.println ("token = " + stok2.nextToken ());
System.out.println ("\r\ncount3 = " + stok3.countTokens ());
while (stok3.hasMoreTokens ())
   System.out.println ("token = " + stok3.nextToken ());

The fragment above divides into three parts. The first part focuses on stok1. After retrieving and printing a token count, a while loop calls nextToken() to extract all tokens if hasMoreTokens() returns true. The second and third parts use identical logic for the other StringTokenizers. If you execute the code fragment, you observe the following output:
count1 = 6
token = A
token = sentence
token = to
token = tokenize.|A
token = second
token = sentence.
count2 = 2
token = A sentence to tokenize.
token = A second sentence.
count3 = 13
token = A
token = 
token = sentence
token = 
token = to
token = 
token = tokenize.
token = |
token = A
token = 
token = second
token = 
token = sentence.

The output above reveals three different token counts for the same string. The counts differ because the sets of delimiters differ. For stok1, the default delimiter set applies. For stok2, only one delimiter is present: the vertical bar. stok3 records a space and a vertical bar as its delimiters. The output's final portion reveals that the space and vertical bar delimiters return as tokens due to passing true as returnDelim's value in the stok3 call.

No comments: