Saturday, October 29, 2011

Steer clear of Java pitfalls - 2


Pitfall 2: A misleading StringTokenizer parameter

This pitfall, also a result of poor naming conventions, revealed itself when a junior developer needed to parse a text file that used a three-character delimiter (his was the string ###) between tokens. In his first attempt, he used the StringTokenizer class to parse the input text. He sought my advice when he discovered what he considered to be strange behavior. The applet below demonstrates code similar to his.
input: 123###4#5###678###hello###wo#rld###9
delim: ###
If '###' treated as a group delimiter expecting 6 tokens...
tok[0]: 123
tok[1]: 4
tok[2]: 5
tok[3]: 678
tok[4]: hello
tok[5]: wo
tok[6]: rld
tok[7]: 9
# of tokens: 8
The developer expected six tokens, but if a single # character was present in any token, he received more. He wanted the delimiter to be the group of three # characters, not a single # character.
Here is the key code used to parse the input string into an array of tokens:
    public static String [] tokenize(String input, String delimiter)
    {
        Vector v = new Vector();
        StringTokenizer t = new StringTokenizer(input, delimiter);
        String cmd[] = null;
        while (t.hasMoreTokens())
            v.addElement(t.nextToken());
        
        int cnt = v.size();
        if (cnt > 0)
        {
            cmd = new String[cnt];
            v.copyInto(cmd);
        }
        return cmd;        
    }
 The tokenize() method is a wrapper for the StringTokenizer class. The StringTokenizer constructor takes two String arguments: one for the input and one for the delimiter. The junior developer incorrectly inferred that the delimiter parameter would be treated as a group of characters, not a set of single characters. I don't think that's such a poor assumption. With thousands of classes in the Java APIs, the burden of design simplicity rests on the designer's shoulders, not the application developer's. It is reasonable to assume that a String would be treated as a single group. After all, a String commonly represents a related grouping of characters.
A more correct StringTokenizer constructor would require the developer to provide an array of characters, which would clarify the fact that the delimiters for the current implementation of StringTokenizer are only single characters -- though you can specify more than one. This particular API designer was more concerned with his implementation's rapid development than its intuitiveness.
To fix the problem, we create two new static tokenize() methods: one that takes an array of characters as delimiters, one that accepts a Boolean flag to signify whether the String delimiter should be regarded as a single group. Here is the code for those two methods:

    // String tokenizer with current behavior
    public static String [] tokenize(String input, char [] delimiters)
    {
        return tokenize(input, new String(delimiters), false);
    }
    public static String [] tokenize(String input, String delimiters,
               boolean delimiterAsGroup)
    {
        Vector v = new Vector();
        String toks[] = null;
        if (!delimiterAsGroup)
        {
            StringTokenizer t = new StringTokenizer(input, delimiters);
            while (t.hasMoreTokens())
                v.addElement(t.nextToken());
        }
        else
        {
            int start = 0;
            int end = input.length();
            while (start < end)
            {
                    int delimIdx = input.indexOf(delimiters,start);
                    if (delimIdx < 0)
                    {
                            String tok = input.substring(start);
                            v.addElement(tok);
                            start = end;
                    }
                    else
                    {
                            String tok = input.substring(start, delimIdx);
                            v.addElement(tok);
                            start = delimIdx + delimiters.length();
                    }
            }
        }
        int cnt = v.size();
        if (cnt > 0)
        {
            toks = new String[cnt];
            v.copyInto(toks);
        }
       
        return toks;
    }
 Below is an applet demonstrating the new static method, tokenize(), that treats the token String ### as a single delimiter.
input: 123###4#5###678###hello###wo#rld###9
delim: ###
If '###' treated as a group delimiter expecting 6 tokens...
tok[0]: 123
tok[1]: 4#5
tok[2]: 678
tok[3]: hello
tok[4]: wo#rld
tok[5]: 9
# of tokens: 6
While some may consider the above pitfall relatively harmless, the next is extremely dangerous and should be seriously considered in any Java development project.

No comments: