Cédric has posted 1 posts at DZone. View Full User Profile

Moving Lucene a step forward

03.28.2008
| 23742 views |
  • submit to reddit

Any Alternative?

I think the last point is the more problematic : Lucene reaches its limits when it goes to searching large datasets (with many operators or not) on modern hardware. That's why I've been looking for an alternative to Lucene. After reading blog entries and a discussion about Wikia, I found that there were not so many alternatives. However, I finally came to a very promising solution : MG4J. It has a very good object design, excellent performance on search (indexing is slower than Lucene), a small memory footprint, is up to 10x faster than Lucene on my span query benchmarks, and is nativelly designed for clustering. It also has built-in support for payloads, while in Lucene it is a very recent addition which is still experimental. However, MG4J still misses some features such as easy incremental indexation (indices ARE clusters, but there's no idea on performance issues), document removal and an easier to use indexing process. What made me happy is that I was able to reproduce the customizations I made on Lucene in a few hours where it took me days on Lucene.

 

I think there's room for a new open source search engine which is not thought in terms of a single computer indexing a collection of documents with limited memory, but in terms of transparent distributed indexation and searching in order to provide fast answers on large datasets (think of Terracotta or GridGain as repartition frameworks) : you should not have to leverage an application to gain access to clustering features. Lucene has an excellent implementation of the first category of search engines, but I think this is not adapted to what we require now : beeing able to find the best answer to a question in a reasonable amount of time. tf/idf based search algorithms and Google rank are not the future of search engines. Finding the most relevant answers implies complex queries involving metadata on documents and semantics, which is basically what Lingway does (with Lucene and other backing search engines), but it requires more power and an underlying technology which supports modern hardware.

A good reason to choose Lucene

Whatever the reproaches I have to make about Lucene, it is still the best java open source solution available for what we are doing ;-)

References
Published at DZone with permission of its author, Cédric CHAMPEAU. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Ronald Miura replied on Fri, 2008/03/28 - 11:31am

Promoting your product by bashing competitors isn't very nice...

By the way, I think Lucene is 'less-than-you'd-want' extensible by design. They choose to make most things 'final' to be able to change the internals without worrying about 'unintended' extension points that people could be using (I've read about it somewhere...). Well, I've used Lucene only for pretty 'standard' things (no cluster, no grid, no fancy custom scoring), and never had problems these limitations.

About interfaces, they aren't always better than (abstract) classes. They are good for extension, but sometimes bad for evolution. Erich Gamma states this in an interview at Artima: "In fact, an abstract class gives you more flexibility when it comes to evolution. You can add new behavior without breaking clients."

One can argue that Lucene's design is 'C++-ish', but well, C++ programmers do care more about performance and resource usage than Java ones, but it's not a bad thing :)

And, of course, it has one great advantage: it's price is unbeatable :)

Sebastian Otaegui replied on Fri, 2008/03/28 - 12:12pm

Oh you french people... :P

Paul Michael Bauer replied on Fri, 2008/03/28 - 12:13pm in response to: Ronald Miura

(1) I don't think he is bashing Lucene.  He is merely pointing out its limitations for a particular case with which he is familiar. 

 and

(2) Lucene and Lingway are not competitors nor are they intended to be

Harry Tuttle replied on Mon, 2008/03/31 - 4:24am

Hmmm. Some points:

6) No built-in support for clustering.
The Lucene library cannot be criticised for missing features it specifically sees as out-of-scope.

5) Span queries are slow.
Automatically turning the phrase "red car" into "[red/burgundy/cherry] near [car/vehicle/automobile]" *will* incur extra cost in *any* search engine. Where is the Lucene design inefficient? Have you considered joining terms into phrases at index time rather than query time? Alternative terms can be posted at the same position in a Lucene index. While you may see this automated query expansion as "the future of search" others may not be quite so convinced - but that's another debate :)

4) Scoring is not really pluggable
You can tweak scoring for coordination factors, field length normalisation, term frequency, inverse document frequency, word proximity, document boosts and query clause boosts. These are the common factors that people need to tweak. Your scoring factors sound uncommon (but you were able to add them anyway, just not as easily). If you think your scoring factors should be made easy to tweak, make a submission and the community will be happy to incorporate them. It's the community that makes/shapes the product through contributions.


5) Lucene is not well designed.
Almost no use of interfaces.
This is done deliberately and for well-considered reasons. Product evolution is significantly easier when using abstract classes in place of interfaces.
Unnatural iterator implementations. No hasNext() method, next() returns a boolean
When in a tight loop that is accessed millions of times for each query there is a good reason not to introduce superfluous method calls. Nice OO abstractions can add unnecessary overhead in a performance-critical application.

6) . A closed API which makes extending Lucene a pain. In Lucene world, it is called a feature. The policy is to open classes when some user needs to gain access to some feature.

Lucene has deliberately marked certain packages as protected or private to indicate which APIs are accepted extension points and which aren't.
This is perfectly reasonable. You obviously have a problem with certain APIs not being public. Have you tried to convince the community to open them up? Again, if the idea is sensible then I'm sure the community would not reject it. You *always* have the option of changing the desired Lucene classes to make them public - it is entirely open source and does not prohibit this kind of modification, even in commercial applications such as yours. The use of public/private just makes it obvious when you are building in areas the Lucene community reserves the right to change freely.

 

 

 

Bruce Ritchie replied on Sat, 2008/04/05 - 6:49pm

6. No builtin support for clustering.

I agree with a previous comment - it's out of scope for what lucene is designed to accomplish. If you want clustering, perhaps carrot2 (http://www.carrot2.org/) is something that you might be interested in.


 

Ted Dunning replied on Mon, 2008/09/08 - 11:28am

 

Regarding clustering, there are two meanings for clustering.  Carrot2 groups documents together.  The original author was talking about scaling the size of the search engine farm.

 In fact, lucene does support scaling much better than it first appears.  Lucene itself defined large scale search as out of scope, but the scoring algorithms and the index format are both designed to allow both horizontal and vertical scaling.  The scoring algorithms allow results from different collections to be merged and the index format supports very fast merging, both of which could have made scaling much more difficult.

 My own experience with Lucene is much more positive than the authors.  I have built very large systems using Lucene, including semantic search systems and have been able to scale it to very large work loads.  Frankly, I find Lucene a much more reasonable storage substrate than mySQL for most web applications.  We couldn't have built Veoh without Lucene and that definitely required scaling the search farm!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.