I am a software engineer at Google on the Android project and the creator of the Java testing framework TestNG. When I'm not updating this weblog with various software-related posts or speaking at conferences, I am busy snowboarding, playing squash, tennis, golf or volleyball or scuba diving. Cedric is a DZone MVB and is not an employee of DZone and has posted 90 posts at DZone. You can read more from them at their website. View Full User Profile

Splitting the Atom is Hard, Splitting Strings is Even Harder

07.06.2009
| 4650 views |
  • submit to reddit

I always seem to underestimate how much readers of this blog enjoy a good coding challenge. A few days ago, one of my coworkers was tripped by a line of code and I thought I'd share his confusion with people following me on Twitter, which turned out to be a fitting medium for this micro challenge:

Pop quiz: "abc.def".split(".").length should return...?

I really didn't think much of it but I got much more responses than I anticipated. A lot of them were incorrect (hey, if I throw a challenge, there has to be a trick), but I still want to congratulate everyone for playing the game and answering without testing the code first. That's the spirit!

A few people saw the trap, and interestingly, they pointed out that I must have made a mistake in the question instead of just giving the correct answer (“You must have meant split("\\.")”).

As hinted above, the trick here is that the parameter to java.lang.String#split is a regular expression, not a string. Since the dot matches all the characters in the given string and that this method cannot return any character that matches the separator, it returns an empty string array, so the answer is "0".

This didn't stop a few people from insisting that the answer is "2", which made me realize that the code snippet above happens to be valid Ruby code, a pretty unlikely coincidence. So to all of you who answered 2 because you thought this was Ruby, you were right. To the others, you are still wrong, sorry :-)

The bottom line is that this method is terribly designed. First of all, the string confusion is pretty common (I've been tricked by it more than a few times myself). A much better design would be to have two overloaded methods, one that takes a String (a real one) and one that takes a regular expression (a class that, unfortunately, doesn't exist in Java, so what you really want is a java.util.regex.Pattern).

API clarity is not the only benefit of this approach, there is also the performance aspect.

With the current signature, each call to split() causes the regular expression to be recompiled, as the source sadly confirms:

public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
Since it's not uncommon to have such parsing code in a loop, this can become a significant performance bottleneck. On the other hand, if we had an overloaded version of split() that accepts a Pattern, it would be possible to precompile this pattern outside the loop.

Interestingly, that's how Ruby implements split():

If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.

If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters.

But there is a catch: since Ruby is dynamically typed (which means it doesn't support overloading either), the determination of the class of the parameter has to be done at runtime, and even though this particular method is implemented in C, there is still an unavoidable performance toll to be paid, which is unfortunate for such a core method.

The conclusion of this little drill is that if you are going to split strings repeatedly, you are better off compiling the regular expression yourself and use Pattern#split instead of String#split.

From http://beust.com/weblog

0
Your rating: None
Published at DZone with permission of Cedric Beust, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Vassil Dichev replied on Mon, 2009/07/06 - 5:03am

Funny, I was tripped by this just 2 days ago. Actually, in Scala there's a split method in RichString which operates on a char. Granted, it still doesn't seem too obvious, especially to newcomers from other languages, but you can distinguish which version you want to use at compile time. Try these at http://www.simplyscala.com:

"abc.def".split(".").length
"abc.def".split('.').length

Jose Maria Arranz replied on Mon, 2009/07/06 - 6:20am

Another annoying behavior of this method is the following:

        System.out.println( ",Hi,Bye,".split(",").length );

  Returns 3

  The first empty string in the beginning counts but not the latter.

 

 

Michael Parmeley replied on Mon, 2009/07/06 - 1:12pm in response to: Jose Maria Arranz

Jose, you need to read the JavaDoc for the split method that takes a limit parameter. Adjusting your limit will get you the result you want for your use case. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#split%28java.lang.String,%20int%29

 You can the results you want by using a negative limit, i.e.:

 

System.out.println( ",Hi,Bye,".split(",", -1).length );

 

The article as a whole seemed to just regurgitate the JavaDoc for the split() method. I don't think there would be any surprises here for anyone that has read the split() JavaDoc and is even slightly familiar with RegExp (in all RegExp flavors I am aware of the "." is a wildcard for any character).

Jilles Van Gurp replied on Mon, 2009/07/06 - 1:58pm

I actually found a misguided use of split today and pointed out the performance penalty to someone :-).

The regex usage in java's String class is downright irresponsble. A few weeks ago I found another nice bottleneck in our code: one of my colleagues had used replaceAll ... in a four times nested loop iterating over hundreds of thousands of objects (outer loop)! Essentially it was being called tens of millions of times with a handful of simple patterns. Our batch processing went from 50 minutes to 5+ hours. I found this one with a profiler. Basically the program was spending 90% of its time compiling the same regular expressions over and over again.

That's extreme of course and that's why I found it. However, I think this is a very common performance bug that can go unnoticed for a very long time. A couple of thousands replaceall's here, a couple of hundred splits there. It adds up. Wouldn't surprise me to find this in some high profile OSS projects.

The fix was easy: I created a StringReplacer class with a Map<String, Pattern> map. Each Pattern only compiled once and we went back to 50 minutes. In my view it is almost always a good idea to pool Pattern instances. The only exception I can think of is when your patterns are somehow input derived, in which case you could run out of memory doing this. Otherwise, compile once, use many times. Same goes for xpath expressions that generally are many times more expensive to compile than to execute. Those are a bit more tricky (not threadsafe, unlike Pattern) but same principle can be applied.

BTW. If anybody involved with Findbugs is reading: this would be a nice one to start catching.

eyveyvh cetrhexyh replied on Fri, 2009/07/10 - 4:49am

This would do the trick:

org.apache.commons.lang.StringUtils.split("abc.def", ".");

I've noticed that migrating to commons-lang StringUtils gives quite a nice performance boost. It's because StringUtils doesn't use regexps.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.