Moving Lucene a step forward
At Lingway, we've been using Lucene for a few years now. For those who are new to Lucene, here's its bottomline : Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java..
Before criticizing, I must admit that Lucene is a very good high-performance full text search engine. For years, Lucene has been considered as a first class citizen when looking for an embeddable search engine written in Java. Its reputation has grown fast, and it is still now the best open source Java search engine available. There's nothing to say about that : Doug Cutting has done a great job. However, it's development has been going very slow those late months, and I think Lucene will most likely not keep in touch with today's document processing needs. Don't mess up : I am no search engine developer, I am a developer which leverages search engines in order to provide high relevance information retrieval technologies.
This post is about why Lucene may not be the best choice for future developments if nothing is done, and why the situation may not be close to change. In our situation, we push Lucene to its limits, although we make it work quite good. It's a reason why we made some suggestions and submitted a patch to Lucene (which does not cover everything listed here) : Lingway uses semantics to generate complex queries where proximity matters. For example, if you are looking for documents on conflicts in middle east, you'll probably also want to find documents talking about war in Iraq. In that case, war and Iraq are called expansions of conflict and middle east respectively. We provide a technology which analyzes your query in order to deduce the most relevant expansions, and generate queries for them. Yet, in order to get relevant results, this is insufficient : Google-like ranking or term frequency scoring like implemented in Lucene do not suit semantic scoring needs. For example, a document which contains both middle and east terms but separated by more than 1 word are most likely not what you want to find. Moreover, we should attribute lower scores on expansions than on the regular words. For example, we'll give a better score to conflict in middle east phrase than in war in Iraq.
At Lingway, we think this kind of document retrieval technology is the future of search engines. Google is good at finding documents, but what we want is to find the most relevant ones. However, most (if not all) of current search engines have not been thought to perform such complex queries... Lucene is used by wikipedia, and you'll notice that if you try to find more than a single word, most results are irrevelant...
Here's a capture of the upcoming Lingway KM 3.7 interface, which demonstrates the requirements. Here, we write a query in french, which is used to find documents in english talking about the same subject. Note that this is more than plain translation, we call it cross language mode :
Note the matches in green : chanteur becomes singer, but we also find matches about singing. Same for pop which expands to blues... Now for the technical part:
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)