Splitting the Atom is Hard, Splitting Strings is Even Harder
I always seem to underestimate how much readers of this blog enjoy a good coding challenge. A few days ago, one of my coworkers was tripped by a line of code and I thought I'd share his confusion with people following me on Twitter, which turned out to be a fitting medium for this micro challenge:
Pop quiz: "abc.def".split(".").length should return...?
I really didn't think much of it but I got much more responses than I anticipated. A lot of them were incorrect (hey, if I throw a challenge, there has to be a trick), but I still want to congratulate everyone for playing the game and answering without testing the code first. That's the spirit!
A few people saw the trap, and interestingly, they pointed out that I must have made a mistake in the question instead of just giving the correct answer (“You must have meant split("\\.")”).
As hinted above, the trick here is that the parameter to java.lang.String#split is a regular expression, not a string. Since the dot matches all the characters in the given string and that this method cannot return any character that matches the separator, it returns an empty string array, so the answer is "0".
This didn't stop a few people from insisting that the answer is "2", which made me realize that the code snippet above happens to be valid Ruby code, a pretty unlikely coincidence. So to all of you who answered 2 because you thought this was Ruby, you were right. To the others, you are still wrong, sorry :-)
The bottom line is that this method is terribly designed. First of all, the string confusion is pretty common (I've been tricked by it more than a few times myself). A much better design would be to have two overloaded methods, one that takes a String (a real one) and one that takes a regular expression (a class that, unfortunately, doesn't exist in Java, so what you really want is a java.util.regex.Pattern).
API clarity is not the only benefit of this approach, there is also the performance aspect.
With the current signature, each call to split() causes the regular expression to be recompiled, as the source sadly confirms:
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
Since it's not uncommon to have such parsing code in a loop, this can
become a significant performance bottleneck. On the other hand, if we
had an overloaded version of split() that accepts a Pattern, it would be possible to precompile this pattern outside the loop.
Interestingly, that's how Ruby implements split():
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.But there is a catch: since Ruby is dynamically typed (which means it doesn't support overloading either), the determination of the class of the parameter has to be done at runtime, and even though this particular method is implemented in C, there is still an unavoidable performance toll to be paid, which is unfortunate for such a core method.If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters.
The conclusion of this little drill is that if you are going to split strings repeatedly, you are better off compiling the regular expression yourself and use Pattern#split instead of String#split.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)






Comments
Vassil Dichev replied on Mon, 2009/07/06 - 5:03am
Funny, I was tripped by this just 2 days ago. Actually, in Scala there's a split method in RichString which operates on a char. Granted, it still doesn't seem too obvious, especially to newcomers from other languages, but you can distinguish which version you want to use at compile time. Try these at http://www.simplyscala.com:
"abc.def".split(".").length"abc.def".split('.').length
Jose Maria Arranz replied on Mon, 2009/07/06 - 6:20am
Another annoying behavior of this method is the following:
System.out.println( ",Hi,Bye,".split(",").length );
Returns 3
The first empty string in the beginning counts but not the latter.
Michael Parmeley replied on Mon, 2009/07/06 - 1:12pm
in response to:
Jose Maria Arranz
Jose, you need to read the JavaDoc for the split method that takes a limit parameter. Adjusting your limit will get you the result you want for your use case. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#split%28java.lang.String,%20int%29
You can the results you want by using a negative limit, i.e.:
System.out.println( ",Hi,Bye,".split(",", -1).length );The article as a whole seemed to just regurgitate the JavaDoc for the split() method. I don't think there would be any surprises here for anyone that has read the split() JavaDoc and is even slightly familiar with RegExp (in all RegExp flavors I am aware of the "." is a wildcard for any character).
Jilles Van Gurp replied on Mon, 2009/07/06 - 1:58pm
I actually found a misguided use of split today and pointed out the performance penalty to someone :-).
The regex usage in java's String class is downright irresponsble. A few weeks ago I found another nice bottleneck in our code: one of my colleagues had used replaceAll ... in a four times nested loop iterating over hundreds of thousands of objects (outer loop)! Essentially it was being called tens of millions of times with a handful of simple patterns. Our batch processing went from 50 minutes to 5+ hours. I found this one with a profiler. Basically the program was spending 90% of its time compiling the same regular expressions over and over again.
That's extreme of course and that's why I found it. However, I think this is a very common performance bug that can go unnoticed for a very long time. A couple of thousands replaceall's here, a couple of hundred splits there. It adds up. Wouldn't surprise me to find this in some high profile OSS projects.
The fix was easy: I created a StringReplacer class with a Map<String, Pattern> map. Each Pattern only compiled once and we went back to 50 minutes. In my view it is almost always a good idea to pool Pattern instances. The only exception I can think of is when your patterns are somehow input derived, in which case you could run out of memory doing this. Otherwise, compile once, use many times. Same goes for xpath expressions that generally are many times more expensive to compile than to execute. Those are a bit more tricky (not threadsafe, unlike Pattern) but same principle can be applied.
BTW. If anybody involved with Findbugs is reading: this would be a nice one to start catching.
eyveyvh cetrhexyh replied on Fri, 2009/07/10 - 4:49am
This would do the trick:
org.apache.commons.lang.StringUtils.split("abc.def", ".");I've noticed that migrating to commons-lang StringUtils gives quite a nice performance boost. It's because StringUtils doesn't use regexps.