Guava Splitter vs StringUtils
So I recently wrote a post about good old reliable Apache Commons StringUtils, which provoked a couple of comments, one of which was that Google Guava provides better mechanisms for joining and splitting Strings. I have to admit, this is a corner of Guava I've yet to explore. So thought I ought to take a closer look, and compare with StringUtils, and I have to admit I was surprised at what I found.
Splitting strings eh? There can't be many different ways of doing this surely?Well Guava and StringUtils do take a sylisticly different approach. Lets start with the basic usage.
// Apache StringUtils...
String[] tokens1 = StringUtils.split("one,two,three",',');
// Guava splitter...
Iterable<String> tokens2 = Splitter.on(',').split("one,two,three");So, my first observation is that Splitter is more object orientated. You have to create a splitter object, which you then use to do the splitting. Whereas the StringUtils splitter methods uses a more functional style, with static methods.
Here I much prefer Splitter. Need a reusable splitter that splits comma separated lists? A splitter that also trims leading and trailing white space, and ignores empty elements? Not a problem:
Splitter niceCommaSplitter = Splitter.on(',')
.omitEmptyString()
.trimResults();
niceCommaSplitter.split("one,, two, three"); //"one","two","three"
niceCommaSplitter.split(" four , five "); //"four","five"That looks really useful, any other differences?
The other thing to notice is that Splitter returns an Iterable<String>, whereas StringUtils.split returns a String array.
Don't really see that making much of a difference, most of the time I just want to loop through the tokens in order anyway!I also didn't think it was a big deal, until I examined the performance of the two approaches. To do this I tried running the following code:
final String numberList = "One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten";
long start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
StringUtils.split(numberList , ',');
}
System.out.println(System.currentTimeMillis() - start);
start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
Splitter.on(',').split(numberList );
}
System.out.println(System.currentTimeMillis() - start);On my machine this output the following times:
59431
Guava's Splitter is almost 10 times faster!
Now this is a much bigger difference than I was expecting, Splitter is over 10 times faster than StringUtils. How can this be? Well, I suspect it's something to do with the return type. Splitter returns an Iterable<String>, whereas StringUtils.split gives you an array of Strings! So Splitter doesn't actually need to create new String objects.
It's also worth noting you can cache your Splitter object, which results in an even faster runtime.
Blimey, end of argument? Guava's Splitter wins every time?Hold on a second. This isn't quite the full story. Notice we're not actually doing anything with the result of the Strings? Like I mentioned, it looks like the Splitter isn't actually creating any new Strings. I suspect it's actually deferring this to the Iterator object it returns.
So can we test this?Sure thing. Here's some code to repeatedly check the lengths of the generated substrings:
final String numberList = "One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten";
long start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
final String[] numbers = StringUtils.split(numberList, ',');
for(String number : numbers) {
number.length();
}
}
System.out.println(System.currentTimeMillis() - start);
Splitter splitter = Splitter.on(',');
start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
Iterable<String> numbers = splitter.split(numberList);
for(String number : numbers) {
number.length();
}
}
System.out.println(System.currentTimeMillis() - start);On my machine this outputs:
6092048
Guava's Splitter is almost 4 times slower!
Indeed, I was expecting them to be about the same, or maybe Guava slightly faster, so this is another surprising result. Looks like by returning an Iterable, Splitter is trading immediate gains, for longer term pain. There's also a moral here about making sure performance tests are actually testing something useful.
In conclusion I think I'll still use Splitter most of the time. On small lists the difference in performance is going to be negligible, and Splitter just feels much nicer to use. Still I was surprised by the result, and if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)






Comments
Der Meister replied on Wed, 2012/04/25 - 1:40am
But in most cases splitting strings is just a very minor part of a software (speed is not important here), so using
myString.trim().split("\\s*,\\s*")
is good enough - and no additional dependency!
Emmanuel Bourg replied on Wed, 2012/04/25 - 2:31am
Lance Semmens replied on Wed, 2012/04/25 - 6:00am
As you are alluding to, Guava does not do any actual work in the split() method. It defers the actual work to the Iterator which splits the next token in it's next() method (or more likely it does this in hasNext()). If you want to benchmark these two methods, you must iterate the results in both.
What might be more interesting than execution time is the memory footprint of each method. I'm guessing that Guava could happily split an enormous stream of characters whereas StringUtils would throw an OutOfMemoryException since it needs to store the entire array.
Pierre Laporte replied on Wed, 2012/04/25 - 7:46am
I do agree with the previous comments. The benchmarks you did are not completely reliable due to the JVM warmup time and Guava implementation of the split() method :
You should run you methods 10 times consecutively in a loop so that you see the speed improvements. On my machine, without any JVM option, I have the following results after 5 passes :
Anyway, thanks for this interresting article
matt inger replied on Wed, 2012/04/25 - 5:07pm
There's also commons-lang's StrTokenizer you could try as well (upfront honesty: I am one of the code contributors to this class)
StrTokenizer tokenizer = StrTokenizer.getCSVInstance() ;
tokenizer.reset(inputString);
while (tokenizer.hasNext()) {
String token = tokenizer.next();
}
The StrTokenizer class implements ListIterator<String> (unfortunately not it's more versitile cousin Iterable<String>). It's more object oriented than StringUtils, but does parse the entire string when you ask for the first token, thus it does have storage requirements.
But it's reusable, and is can be configured for just about any realistic delimited text.
I'm curious how it would perform in your tests.
Srikanth Nair replied on Wed, 2012/04/25 - 11:50pm
Please ignore all this crap from google... Google is best as a search engine but not on opensource libraries like guava...
There is no point of using a slow library because of its redability.
A library must be efficient and straight forward and understandable.
// Apache StringUtils...
String[] tokens1 = StringUtils.split("one,two,three",',');
// Guava splitter...
Iterable<String> tokens2 = Splitter.on(',').split("one,two,three");
better than these two,
"some,crap".split(",") is enough.
What is the meaning of Splitter.on(',') ?
bullshit,
such an idiotic fluent usage, will definitly confuse developer.