Pitfall 2: A misleading StringTokenizer parameter
This pitfall, also a result of poor naming conventions, revealed itself when a junior developer needed to parse a text file that used a three-character delimiter (his was the string
###
) between tokens. In his first attempt, he used the
StringTokenizer
class to parse the input text. He sought my advice when he discovered what he considered to be strange behavior. The applet below demonstrates code similar to his.
input: 123###4#5###678###hello###wo#rld###9
delim: ###
If '###' treated as a group delimiter expecting 6 tokens...
tok[0]: 123
tok[1]: 4
tok[2]: 5
tok[3]: 678
tok[4]: hello
tok[5]: wo
tok[6]: rld
tok[7]: 9
# of tokens: 8
The developer expected six tokens, but if a single
#
character was present in any token, he received more. He wanted the delimiter to be the group of three
#
characters, not a single
#
character.
Here is the key code used to parse the input string into an array of tokens:
public static String [] tokenize(String input, String delimiter)
{
Vector v = new Vector();
StringTokenizer t = new StringTokenizer(input, delimiter);
String cmd[] = null;
while (t.hasMoreTokens())
v.addElement(t.nextToken());
int cnt = v.size();
if (cnt > 0)
{
cmd = new String[cnt];
v.copyInto(cmd);
}
return cmd;
}
The
tokenize()
method is a wrapper for the
StringTokenizer
class. The
StringTokenizer
constructor takes two
String
arguments: one for the input and one for the delimiter. The junior developer incorrectly inferred that the delimiter parameter would be treated as a group of characters, not a set of single characters. I don't think that's such a poor assumption. With thousands of classes in the Java APIs, the burden of design simplicity rests on the designer's shoulders, not the application developer's. It is reasonable to assume that a
String
would be treated as a single group. After all, a
String
commonly represents a related grouping of characters.
A more correct
StringTokenizer
constructor would require the developer to provide an array of characters, which would clarify the fact that the delimiters for the current implementation of
StringTokenizer
are only single characters -- though you can specify more than one. This particular API designer was more concerned with his implementation's rapid development than its intuitiveness.
To fix the problem, we create two new static
tokenize()
methods: one that takes an array of characters as delimiters, one that accepts a Boolean flag to signify whether the
String
delimiter should be regarded as a single group. Here is the code for those two methods:
// String tokenizer with current behavior
public static String [] tokenize(String input, char [] delimiters)
{
return tokenize(input, new String(delimiters), false);
}
public static String [] tokenize(String input, String delimiters,
boolean delimiterAsGroup)
{
Vector v = new Vector();
String toks[] = null;
if (!delimiterAsGroup)
{
StringTokenizer t = new StringTokenizer(input, delimiters);
while (t.hasMoreTokens())
v.addElement(t.nextToken());
}
else
{
int start = 0;
int end = input.length();
while (start < end)
{
int delimIdx = input.indexOf(delimiters,start);
if (delimIdx < 0)
{
String tok = input.substring(start);
v.addElement(tok);
start = end;
}
else
{
String tok = input.substring(start, delimIdx);
v.addElement(tok);
start = delimIdx + delimiters.length();
}
}
}
int cnt = v.size();
if (cnt > 0)
{
toks = new String[cnt];
v.copyInto(toks);
}
return toks;
}
Below is an applet demonstrating the new static method, tokenize(), that treats the token String ### as a single delimiter.
input: 123###4#5###678###hello###wo#rld###9
delim: ###
If '###' treated as a group delimiter expecting 6 tokens...
tok[0]: 123
tok[1]: 4#5
tok[2]: 678
tok[3]: hello
tok[4]: wo#rld
tok[5]: 9
# of tokens: 6
While some may consider the above pitfall relatively harmless, the next is extremely dangerous and should be seriously considered in any Java development project.