Jack of all trades software developer, specialising in Web and Mobile development. Have been programming professionally for over 10 years. These days I mainly do Java and Objective-C, but over the years have developed software in C++, Perl, and Javascript. Tom is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website. View Full User Profile

Guava Splitter vs StringUtils

04.24.2012
| 15648 views |
  • submit to reddit

So I recently wrote a post about good old reliable Apache Commons StringUtils, which provoked a couple of comments, one of which was that Google Guava provides better mechanisms for joining and splitting Strings. I have to admit, this is a corner of Guava I've yet to explore. So thought I ought to take a closer look, and compare with StringUtils, and I have to admit I was surprised at what I found.

Splitting strings eh? There can't be many different ways of doing this surely?

Well Guava and StringUtils do take a sylisticly different approach. Lets start with the basic usage.

// Apache StringUtils...
String[] tokens1 = StringUtils.split("one,two,three",',');
 
// Guava splitter...
Iterable<String> tokens2 = Splitter.on(',').split("one,two,three");

So, my first observation is that Splitter is more object orientated. You have to create a splitter object, which you then use to do the splitting. Whereas the StringUtils splitter methods uses a more functional style, with static methods.

Here I much prefer Splitter. Need a reusable splitter that splits comma separated lists? A splitter that also trims leading and trailing white space, and ignores empty elements? Not a problem:

Splitter niceCommaSplitter = Splitter.on(',')
                              .omitEmptyString()
                              .trimResults();
 
niceCommaSplitter.split("one,, two,  three"); //"one","two","three"
niceCommaSplitter.split("  four  ,  five  "); //"four","five"

That looks really useful, any other differences?

The other thing to notice is that Splitter returns an Iterable<String>, whereas StringUtils.split returns a String array.

Don't really see that making much of a difference, most of the time I just want to loop through the tokens in order anyway!

I also didn't think it was a big deal, until I examined the performance of the two approaches. To do this I tried running the following code:

final String numberList = "One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten";
 
long start = System.currentTimeMillis(); 
for(int i=0; i<1000000; i++) {
    StringUtils.split(numberList , ',');  
}
System.out.println(System.currentTimeMillis() - start);
   
start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
    Splitter.on(',').split(numberList );
}
System.out.println(System.currentTimeMillis() - start);

On my machine this output the following times:

594
31

Guava's Splitter is almost 10 times faster!

Now this is a much bigger difference than I was expecting, Splitter is over 10 times faster than StringUtils. How can this be? Well, I suspect it's something to do with the return type. Splitter returns an Iterable<String>, whereas StringUtils.split gives you an array of Strings! So Splitter doesn't actually need to create new String objects.

It's also worth noting you can cache your Splitter object, which results in an even faster runtime.

Blimey, end of argument? Guava's Splitter wins every time?

Hold on a second. This isn't quite the full story. Notice we're not actually doing anything with the result of the Strings? Like I mentioned, it looks like the Splitter isn't actually creating any new Strings. I suspect it's actually deferring this to the Iterator object it returns.

So can we test this?

Sure thing. Here's some code to repeatedly check the lengths of the generated substrings:

final String numberList = "One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten";
long start = System.currentTimeMillis(); 
for(int i=0; i<1000000; i++) {
  final String[] numbers = StringUtils.split(numberList, ',');
    for(String number : numbers) {
      number.length();
    }
  }
System.out.println(System.currentTimeMillis() - start);
   
Splitter splitter = Splitter.on(',');
start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
  Iterable<String> numbers = splitter.split(numberList);
    for(String number : numbers) {
      number.length();
    }
  }
System.out.println(System.currentTimeMillis() - start);

On my machine this outputs:

609
2048

Guava's Splitter is almost 4 times slower!

Indeed, I was expecting them to be about the same, or maybe Guava slightly faster, so this is another surprising result. Looks like by returning an Iterable, Splitter is trading immediate gains, for longer term pain. There's also a moral here about making sure performance tests are actually testing something useful.

In conclusion I think I'll still use Splitter most of the time. On small lists the difference in performance is going to be negligible, and Splitter just feels much nicer to use. Still I was surprised by the result, and if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils.

 

 

 

 

 

 

 

 

Published at DZone with permission of Tom Jefferys, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Der Meister replied on Wed, 2012/04/25 - 1:40am

But in most cases splitting strings is just a very minor part of a software (speed is not important here), so using

myString.trim().split("\\s*,\\s*")

is good enough - and no additional dependency!

Emmanuel Bourg replied on Wed, 2012/04/25 - 2:31am

If you do micro benchmarks you should test the two implementations separately, because the execution order can have an impact (typically the last test to run benefits from the "warm up" of the JVM caused by the previous test).

Lance Semmens replied on Wed, 2012/04/25 - 6:00am

As you are alluding to, Guava does not do any actual work in the split() method. It defers the actual work to the Iterator which splits the next token in it's next() method (or more likely it does this in hasNext()). If you want to benchmark these two methods, you must iterate the results in both.

What might be more interesting than execution time is the memory footprint of each method. I'm guessing that Guava could happily split an enormous stream of characters whereas StringUtils would throw an OutOfMemoryException since it needs to store the entire array.

Pierre Laporte replied on Wed, 2012/04/25 - 7:46am

I do agree with the previous comments. The benchmarks you did are not completely reliable due to the JVM warmup time and Guava implementation of the split() method :

  • StringUtils compute everything during the invocation of the split() method while Guava computes each token during the invocation of iterator.next()
  • During the benchmark, the JVM may compile your classes into native code so the next iterations will be faster 

 

You should run you methods 10 times consecutively in a loop so that you see the speed improvements. On my machine, without any JVM option, I have the following results after 5 passes : 

  • Pass #5 of StringUtils = ~500ms
  • Pass #5 of Splitter = ~650ms

 

Anyway, thanks for this interresting article  

matt inger replied on Wed, 2012/04/25 - 5:07pm

There's also commons-lang's StrTokenizer you could try as well (upfront honesty:  I am one of the code contributors to this class)

 

StrTokenizer tokenizer = StrTokenizer.getCSVInstance() ;

tokenizer.reset(inputString);

while (tokenizer.hasNext()) {

  String token = tokenizer.next();

 

The StrTokenizer class implements ListIterator<String> (unfortunately not it's more versitile cousin Iterable<String>).  It's more object oriented than StringUtils, but does parse the entire string when you ask for the first token, thus it does have storage requirements.

But it's reusable, and is can be configured for just about any realistic delimited text.

I'm curious how it would perform in your tests. 

Srikanth Nair replied on Wed, 2012/04/25 - 11:50pm

Please ignore all this crap from google... Google is best as a search engine but not on opensource libraries like guava... 
There is no point of using a slow library because of its redability.
A library must be efficient and straight forward and understandable. 


 // Apache StringUtils...
String[] tokens1 = StringUtils.split("one,two,three",',');


// Guava splitter...
Iterable<String> tokens2 = Splitter.on(',').split("one,two,three");

better than these two,

"some,crap".split(",") is enough.

What is the meaning of Splitter.on(',') ?
bullshit, 

such an idiotic fluent usage, will definitly confuse developer. 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.