Cédric has posted 1 posts at DZone. View Full User Profile

Moving Lucene a step forward

03.28.2008
| 24669 views |
  • submit to reddit

6 ways of improving Lucene

6. No built-in support for clustering. If you want to create clusters, either write your own implementation of a Directory, or use Solr, or Nutch+Hadoop. Both Solr and Nutch leverage Lucene, but are not straight replacements. Lucene is embeddable, while you must leverage on Solr or Nutch. Well, I think it is not very surprising that Hadoop idea emerged from the Lucene team : Lucene doesn't scale out. It's internals makes it (very) fast for most common situations, but for large document sets, you have to scale out, and as Lucene does not implement clustering at the core level, you must switch from Lucene to another search engine layer, which is not straightforward. The problem with switching to Solr or Nutch is that you carry things you probably won't need : integrated crawling for Nutch and indexing server for Solr.

5. Span queries are slow. This may be interpreted as a problem specific to Lingway where we make intensive use of span queries (NEAR operator : "red NEAR car"), but the Lucene index structure has been updated to add this feature, which was not thought at first. The underlying implementation leads to complex algorithms that are very slow, especially when some term is repeated many times in a single large document. That's why I tend to say that Lucene is a high-performance text search engine only if you use basic boolean queries.

4. Scoring is not really pluggable. Lucene has its own implementation of a scoring algorithm, where terms may be boosted, and makes use of a Similarity class, but soon shows limitations when you want to perform complex scoring, for example based on actual matches and meta data about the query. If you want to do so, you'll have to extend the Lucene query classes. The facts is that Lucene has been thought in terms of tf/idf like scoring algorithms, while in our situation, for linguistic based scoring, the structure of Lucene scoring facilities don't fit. We've been obliged to override every Lucene query class in order to add support for our custom scoring. And that was a problem :

3. Lucene is not well designed. As a software architect, I would tend to make this reason 1. Lucene has a very poor OO design. Basically, you have packages, classes, but almost no design pattern usage. I always makes me think about an application written by C(++) developers who discover Java and carry bad practices among them. This is a serious point : whenever you have to customize Lucene to suit your needs (and you will have to do so), you'll have to face the problem. Here are some examples :

  • Almost no use of interfaces. Query classes (for example BooleanQuery, SpanQuery, TermQuery...) are all subclasses of an abstract class. If you want to add a feature to those classes, you'll basically want to write an interface which describes the contract for your extension, but as the abstract Query class does not implement an interface, you'll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls. Tons of examples like this (HitCollector, ...). This is also a problem when you wan't to use AOP and auto-proxying.
  • Unnatural iterator implementations. No hasNext() method, next() returns a boolean and refreshes the object context. This is a pain when you want to keep track of iterated elements. I assume this have been done intentionally to reduce memory footprint but it makes once again algorithms both unclear and complex.

2. A closed API which makes extending Lucene a pain. In Lucene world, it is called a feature. The policy is to open classes when some user needs to gain access to some feature. This leads to an API where most classes are package protected, which means you won't ever be able to extend it (unless you create your class in the same package, which is quite dirty for custom code) or you'll have to copy and rewrite the code. Moreover, and it is quite related to the previous point, there's a serious lack of OO design here too : some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behaviour. The reason invoked for closing the API is that the code has to be cleaned up and made stable before releasing it publicly. While the idea is honourable, once again it is a pain because if you have some code that does not fit in the mainstream Lucene idea, you'll have to constantly backport Lucene improvements to your version until your patch is accepted. However, as the developers want to limit API changes as long as possible, there's little chance that your patch will ever be commited. Add some final modifiers on either classes or methods and you'll face the problem. I don't think the Spring framework would have come so popular if the code had been so locked...

1. Lucene search algorithms are not adapted to grid computing. Lucene has been written when hardware did not have memory that much and multicore processors didn't exist. Therefore, the index structure has been thought and implemented in order to perform fast linear searches with a very little memory footprint. I've personally spent lots of hours trying to rewrite span query algorithms so that I could make use of a multithreaded context (for dual/quad core cpus), but the iterator-based directory reading algorithms make it hardly impossible to do so. In some rare cases you'll be able to optimize things and iterate the index in parallel, but in most situations it will be impossible. In our case, when we have a very complex query with 50+ embedded span queries, the CPU is almost not used while the system is constantly calling I/Os, even using a RAMDirectory.

References
Published at DZone with permission of its author, Cédric CHAMPEAU. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Ronald Miura replied on Fri, 2008/03/28 - 11:31am

Promoting your product by bashing competitors isn't very nice...

By the way, I think Lucene is 'less-than-you'd-want' extensible by design. They choose to make most things 'final' to be able to change the internals without worrying about 'unintended' extension points that people could be using (I've read about it somewhere...). Well, I've used Lucene only for pretty 'standard' things (no cluster, no grid, no fancy custom scoring), and never had problems these limitations.

About interfaces, they aren't always better than (abstract) classes. They are good for extension, but sometimes bad for evolution. Erich Gamma states this in an interview at Artima: "In fact, an abstract class gives you more flexibility when it comes to evolution. You can add new behavior without breaking clients."

One can argue that Lucene's design is 'C++-ish', but well, C++ programmers do care more about performance and resource usage than Java ones, but it's not a bad thing :)

And, of course, it has one great advantage: it's price is unbeatable :)

Sebastian Otaegui replied on Fri, 2008/03/28 - 12:12pm

Oh you french people... :P

Paul Michael Bauer replied on Fri, 2008/03/28 - 12:13pm in response to: Ronald Miura

(1) I don't think he is bashing Lucene.  He is merely pointing out its limitations for a particular case with which he is familiar. 

 and

(2) Lucene and Lingway are not competitors nor are they intended to be

Harry Tuttle replied on Mon, 2008/03/31 - 4:24am

Hmmm. Some points:

6) No built-in support for clustering.
The Lucene library cannot be criticised for missing features it specifically sees as out-of-scope.

5) Span queries are slow.
Automatically turning the phrase "red car" into "[red/burgundy/cherry] near [car/vehicle/automobile]" *will* incur extra cost in *any* search engine. Where is the Lucene design inefficient? Have you considered joining terms into phrases at index time rather than query time? Alternative terms can be posted at the same position in a Lucene index. While you may see this automated query expansion as "the future of search" others may not be quite so convinced - but that's another debate :)

4) Scoring is not really pluggable
You can tweak scoring for coordination factors, field length normalisation, term frequency, inverse document frequency, word proximity, document boosts and query clause boosts. These are the common factors that people need to tweak. Your scoring factors sound uncommon (but you were able to add them anyway, just not as easily). If you think your scoring factors should be made easy to tweak, make a submission and the community will be happy to incorporate them. It's the community that makes/shapes the product through contributions.


5) Lucene is not well designed.
Almost no use of interfaces.
This is done deliberately and for well-considered reasons. Product evolution is significantly easier when using abstract classes in place of interfaces.
Unnatural iterator implementations. No hasNext() method, next() returns a boolean
When in a tight loop that is accessed millions of times for each query there is a good reason not to introduce superfluous method calls. Nice OO abstractions can add unnecessary overhead in a performance-critical application.

6) . A closed API which makes extending Lucene a pain. In Lucene world, it is called a feature. The policy is to open classes when some user needs to gain access to some feature.

Lucene has deliberately marked certain packages as protected or private to indicate which APIs are accepted extension points and which aren't.
This is perfectly reasonable. You obviously have a problem with certain APIs not being public. Have you tried to convince the community to open them up? Again, if the idea is sensible then I'm sure the community would not reject it. You *always* have the option of changing the desired Lucene classes to make them public - it is entirely open source and does not prohibit this kind of modification, even in commercial applications such as yours. The use of public/private just makes it obvious when you are building in areas the Lucene community reserves the right to change freely.

 

 

 

Bruce Ritchie replied on Sat, 2008/04/05 - 6:49pm

6. No builtin support for clustering.

I agree with a previous comment - it's out of scope for what lucene is designed to accomplish. If you want clustering, perhaps carrot2 (http://www.carrot2.org/) is something that you might be interested in.


 

Ted Dunning replied on Mon, 2008/09/08 - 11:28am

 

Regarding clustering, there are two meanings for clustering.  Carrot2 groups documents together.  The original author was talking about scaling the size of the search engine farm.

 In fact, lucene does support scaling much better than it first appears.  Lucene itself defined large scale search as out of scope, but the scoring algorithms and the index format are both designed to allow both horizontal and vertical scaling.  The scoring algorithms allow results from different collections to be merged and the index format supports very fast merging, both of which could have made scaling much more difficult.

 My own experience with Lucene is much more positive than the authors.  I have built very large systems using Lucene, including semantic search systems and have been able to scale it to very large work loads.  Frankly, I find Lucene a much more reasonable storage substrate than mySQL for most web applications.  We couldn't have built Veoh without Lucene and that definitely required scaling the search farm!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.