Cédric has posted 1 posts at DZone. View Full User Profile

Moving Lucene a step forward

03.28.2008
| 25034 views |
  • submit to reddit

At Lingway, we've been using Lucene for a few years now. For those who are new to Lucene, here's its bottomline : Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java..

Before criticizing, I must admit that Lucene is a very good high-performance full text search engine. For years, Lucene has been considered as a first class citizen when looking for an embeddable search engine written in Java. Its reputation has grown fast, and it is still now the best open source Java search engine available. There's nothing to say about that : Doug Cutting has done a great job. However, it's development has been going very slow those late months, and I think Lucene will most likely not keep in touch with today's document processing needs. Don't mess up : I am no search engine developer, I am a developer which leverages search engines in order to provide high relevance information retrieval technologies.

This post is about why Lucene may not be the best choice for future developments if nothing is done, and why the situation may not be close to change. In our situation, we push Lucene to its limits, although we make it work quite good. It's a reason why we made some suggestions and submitted a patch to Lucene (which does not cover everything listed here) : Lingway uses semantics to generate complex queries where proximity matters. For example, if you are looking for documents on conflicts in middle east, you'll probably also want to find documents talking about war in Iraq. In that case, war and Iraq are called expansions of conflict and middle east respectively. We provide a technology which analyzes your query in order to deduce the most relevant expansions, and generate queries for them. Yet, in order to get relevant results, this is insufficient : Google-like ranking or term frequency scoring like implemented in Lucene do not suit semantic scoring needs. For example, a document which contains both middle and east terms but separated by more than 1 word are most likely not what you want to find. Moreover, we should attribute lower scores on expansions than on the regular words. For example, we'll give a better score to conflict in middle east phrase than in war in Iraq.

At Lingway, we think this kind of document retrieval technology is the future of search engines. Google is good at finding documents, but what we want is to find the most relevant ones. However, most (if not all) of current search engines have not been thought to perform such complex queries... Lucene is used by wikipedia, and you'll notice that if you try to find more than a single word, most results are irrevelant...

Here's a capture of the upcoming Lingway KM 3.7 interface, which demonstrates the requirements. Here, we write a query in french, which is used to find documents in english talking about the same subject. Note that this is more than plain translation, we call it cross language mode :

Note the matches in green : chanteur becomes singer, but we also find matches about singing. Same for pop which expands to blues... Now for the technical part:

References
Published at DZone with permission of its author, Cédric CHAMPEAU. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags:

Comments

Ronald Miura replied on Fri, 2008/03/28 - 11:31am

Promoting your product by bashing competitors isn't very nice...

By the way, I think Lucene is 'less-than-you'd-want' extensible by design. They choose to make most things 'final' to be able to change the internals without worrying about 'unintended' extension points that people could be using (I've read about it somewhere...). Well, I've used Lucene only for pretty 'standard' things (no cluster, no grid, no fancy custom scoring), and never had problems these limitations.

About interfaces, they aren't always better than (abstract) classes. They are good for extension, but sometimes bad for evolution. Erich Gamma states this in an interview at Artima: "In fact, an abstract class gives you more flexibility when it comes to evolution. You can add new behavior without breaking clients."

One can argue that Lucene's design is 'C++-ish', but well, C++ programmers do care more about performance and resource usage than Java ones, but it's not a bad thing :)

And, of course, it has one great advantage: it's price is unbeatable :)

Sebastian Otaegui replied on Fri, 2008/03/28 - 12:12pm

Oh you french people... :P

Paul Michael Bauer replied on Fri, 2008/03/28 - 12:13pm in response to: Ronald Miura

(1) I don't think he is bashing Lucene.  He is merely pointing out its limitations for a particular case with which he is familiar. 

 and

(2) Lucene and Lingway are not competitors nor are they intended to be

Harry Tuttle replied on Mon, 2008/03/31 - 4:24am

Hmmm. Some points:

6) No built-in support for clustering.
The Lucene library cannot be criticised for missing features it specifically sees as out-of-scope.

5) Span queries are slow.
Automatically turning the phrase "red car" into "[red/burgundy/cherry] near [car/vehicle/automobile]" *will* incur extra cost in *any* search engine. Where is the Lucene design inefficient? Have you considered joining terms into phrases at index time rather than query time? Alternative terms can be posted at the same position in a Lucene index. While you may see this automated query expansion as "the future of search" others may not be quite so convinced - but that's another debate :)

4) Scoring is not really pluggable
You can tweak scoring for coordination factors, field length normalisation, term frequency, inverse document frequency, word proximity, document boosts and query clause boosts. These are the common factors that people need to tweak. Your scoring factors sound uncommon (but you were able to add them anyway, just not as easily). If you think your scoring factors should be made easy to tweak, make a submission and the community will be happy to incorporate them. It's the community that makes/shapes the product through contributions.


5) Lucene is not well designed.
Almost no use of interfaces.
This is done deliberately and for well-considered reasons. Product evolution is significantly easier when using abstract classes in place of interfaces.
Unnatural iterator implementations. No hasNext() method, next() returns a boolean
When in a tight loop that is accessed millions of times for each query there is a good reason not to introduce superfluous method calls. Nice OO abstractions can add unnecessary overhead in a performance-critical application.

6) . A closed API which makes extending Lucene a pain. In Lucene world, it is called a feature. The policy is to open classes when some user needs to gain access to some feature.

Lucene has deliberately marked certain packages as protected or private to indicate which APIs are accepted extension points and which aren't.
This is perfectly reasonable. You obviously have a problem with certain APIs not being public. Have you tried to convince the community to open them up? Again, if the idea is sensible then I'm sure the community would not reject it. You *always* have the option of changing the desired Lucene classes to make them public - it is entirely open source and does not prohibit this kind of modification, even in commercial applications such as yours. The use of public/private just makes it obvious when you are building in areas the Lucene community reserves the right to change freely.

 

 

 

Bruce Ritchie replied on Sat, 2008/04/05 - 6:49pm

6. No builtin support for clustering.

I agree with a previous comment - it's out of scope for what lucene is designed to accomplish. If you want clustering, perhaps carrot2 (http://www.carrot2.org/) is something that you might be interested in.


 

Ted Dunning replied on Mon, 2008/09/08 - 11:28am

 

Regarding clustering, there are two meanings for clustering.  Carrot2 groups documents together.  The original author was talking about scaling the size of the search engine farm.

 In fact, lucene does support scaling much better than it first appears.  Lucene itself defined large scale search as out of scope, but the scoring algorithms and the index format are both designed to allow both horizontal and vertical scaling.  The scoring algorithms allow results from different collections to be merged and the index format supports very fast merging, both of which could have made scaling much more difficult.

 My own experience with Lucene is much more positive than the authors.  I have built very large systems using Lucene, including semantic search systems and have been able to scale it to very large work loads.  Frankly, I find Lucene a much more reasonable storage substrate than mySQL for most web applications.  We couldn't have built Veoh without Lucene and that definitely required scaling the search farm!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.