Moving Lucene a step forward
I think the last point is the more problematic : Lucene reaches its limits when it goes to searching large datasets (with many operators or not) on modern hardware. That's why I've been looking for an alternative to Lucene. After reading blog entries and a discussion about Wikia, I found that there were not so many alternatives. However, I finally came to a very promising solution : MG4J. It has a very good object design, excellent performance on search (indexing is slower than Lucene), a small memory footprint, is up to 10x faster than Lucene on my span query benchmarks, and is nativelly designed for clustering. It also has built-in support for payloads, while in Lucene it is a very recent addition which is still experimental. However, MG4J still misses some features such as easy incremental indexation (indices ARE clusters, but there's no idea on performance issues), document removal and an easier to use indexing process. What made me happy is that I was able to reproduce the customizations I made on Lucene in a few hours where it took me days on Lucene.
I think there's room for a new open source search engine which is not thought in terms of a single computer indexing a collection of documents with limited memory, but in terms of transparent distributed indexation and searching in order to provide fast answers on large datasets (think of Terracotta or GridGain as repartition frameworks) : you should not have to leverage an application to gain access to clustering features. Lucene has an excellent implementation of the first category of search engines, but I think this is not adapted to what we require now : beeing able to find the best answer to a question in a reasonable amount of time. tf/idf based search algorithms and Google rank are not the future of search engines. Finding the most relevant answers implies complex queries involving metadata on documents and semantics, which is basically what Lingway does (with Lucene and other backing search engines), but it requires more power and an underlying technology which supports modern hardware.
A good reason to choose Lucene
Whatever the reproaches I have to make about Lucene, it is still the best java open source solution available for what we are doing ;-)
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)