Lucene & Solr Year 2011 in Review
We should start by pointing out that this year Apache Lucene celebrated its 10 year anniversary as an Apache Software Foundation project. Lucene itself is actually over 10 years old. Otis is one of the very few people from the early years who is still around. While we didn't celebrations any Solr anniversaries this year, we should note that Solr, too, has been around for quite a while and is in fact approaching its 6th year at ASF!
This year saw numerous changes and additions both in Lucene and Solr. As a matter of fact, we'd venture to say we saw more changes in Lucene & Solr this year than in any one year before. In that sense, both projects are very much like wine - getting better with time. Lets take a look at a few of the most significant changes in 2011.
The much anticipated Near Real-Time search (NRT) functionality has arrived. What this means is that documents that were just added to a Lucene/Solr index can immediately be made visible in search results. This is big! Of course, work on NRT is still in progress, but NRT is ready and you, like a number of our clients, should start using it.
Field Collapsing was one of the most watched and voted for JIRA issues for many month. This functionality was implemented this year and now Lucene and Solr users can group result on the basis of a field or a query. In addition, you can control the groups and even do faceting calculation on the groups, not single documents. A rather powerful feature.
From Lucene users' perspective it is also worth noting that Lucene finally got a faceting module. Until now, faceting was available only in Solr. If you are a pure Lucene users, you now don't need Solr to calculate facets.
In the past modeling parent-child relationships in Lucene and Solr indices was not really possible - one had to flatten everything. No longer - if you need to model a parent-child relationship in your index you can use the Join contrib module. This Join functionality lets you join parent and child documents at query-time, while relaying on some assumptions about how documents were indexed.
Good and broad language support is hugely important for any search solution and this year was good for Lucene and Solr in that department: KStemFilter English stemmer was added, full Unicode 4 support was added, a new Japanese and Chinese support was added, a new stemmer-protection mechanism was added, work on synonym filter RAM consumption reduction was done, etc. Another big addition was integration with Hunspell, which enables language-specific processing for all languages supported by Open Office. That's a lot of new languages we can now handle with Lucene and Solr! There is more.
Lucene 3.5.0 introduced significantly reduced the term dictionary memory footprint. Big! Right now, Lucene uses 3 to 5 times less memory for when dealing with terms dictionary, so it's even less RAM consuming.
If you use Lucene and need to page through a lot of results you may run into problems. That's why in Lucene 3.5.0 the searchAfter method was introduced which solves the deep paging problem once and for all!
There is also a new, fast and reliable Term Vector-based highlighter that both Lucene and Solr can use.
Dismax is great, but Extended Dismax query parser added to Solr is even better - it extends Dismax query parser functionality and can further improve the quality of search results.
You can now also sort by function (imagine sorting the results by distance from a point) and a new spatial search with filtering.
Solr also got the new suggest/autocomplete functionality based on FST automaton which significantly reduced the memory needed for such functionality. If you need this for your search application, have a look at Sematext's AutoComplete - it has additional functionality that lots of our customers like.
While not yet officially released, the new transaction log support provides Solr with a real-time get operation - as soon as you add a document you can retrieve it by ID. This will also be used for recovering nodes in SolrCloud.
And talking about SolrCloud... We've covered SolrCloud on this blog before in Solr Digest, Spring-Summer 2011, Part 2: Solr Cloud and Near Real Time Search, and we'll be covering it again soon. In short, SolrCloud will make it easier for people to operate larger Solr clusters by making use of more modern design principles and software components such as ZooKeeper, that make creation of distributed, cluster-based software/services easier. Some of the core functionality is that there will be no single point of failure, any node will be able to handle any operation, there will be no traditional master-slave setup, there will be centralized cluster management and configuration, failovers will be automatic and in general things will be much more dynamic. SolrCloud has not been released yet, but Solr developers are working on it and the codebase is seeing good progress. We've used SolrCloud in a few recent engagements with our customers and were pleased by what we saw.
After merging developments of those two projects back in the 2010, we saw a speed up in development and releases. Lucene and Solr committers introduced five(!) new versions of both projects! In March, Lucene and Solr 3.1 was released with the Unicode 4 support, ReusableTokenStream, Spatial search, Vector-based Highlighter, Extended Dismax parser, and many more features and bug fixes. Then, after less than 3 months(!) on June 4th, version 3.2 was released. This release introduced a new and much desired results grouping module, NRTCachingDirectory, and highlighting performance improvements. Just one month later, on July 1st, Lucene and Solr 3.3 were introduced. That release included KStem stemmer, new implementations of Spellchecker, Field Collapsing in Solr and RAM usage reduction for autocomplete mechanism. By the end of summer there was another release, this time it was version 3.4 released on the 14th of September. Pure Lucene users got what Solr could do for a very long time - the long awaited faceting module contributed by IBM. Version 3.4 also included the new Join functionality, ability to turn off query and filter caches and faceting calculation for Field Collapsing. The last release of Lucene and Solr saw the light of day in late November. The 3.5.0 version consisted of huge memory reduction when dealing with term dictionaries, deep paging support, SearcherManager and SearcherLifetimeManager classes along with language identification provided by Tika, as well as sortMissingFirst and sortMissingLast support for TrieFields.
During the last 12 months we attended three major conferences focused on search and big data themes. Lucene Revolution took place in San Francisco in May. Otis gave a talk titled "Search Analytics: What? Why? How?" (slides) during the first day. There were a number of other good talks there and the complete conference agenda is available on http://lucenerevolution.com/2011/agenda. Some videos are available as well. Next came the Berlin Buzzwords conference, a more grass-roots conference which took place between 4th and 10th of June. Otis gave the updated version of his "Search Analytics: What? Why? How?". If you want to know more, check conference official site - http://berlinbuzzwords.de. The last conference focused exclusively on Lucene and Solr was Lucene Eurocon 2011 in sunny and tourist-filled Barcelona between 17th and 20th of October. And guess what - we were there again (surprise!), this time in slightly larger numbers. Otis gave a talk about "Search Analytics: Business Value & BigData NoSQL Backend" (video, slides) and Rafał gave a talk on a pretty popular topic - "Explaining & Visualizing Solr 'explain' information" (video, slides).
No open source project can endure without regular injections of new blood. This year, Lucene and Solr development team was joined by a number of new people whose names may look familiar to you:
These 7 men are now Lucene and Solr committers and we look forward to our next year's Year in Review post, where we hope to go over the good things these people will have brought to Lucene and Solr in 2012.
You know an open source project is successful when a whole book is dedicated to it. You know a project is very successful when more than one book and more than one publisher cover it. There were no new editions of Lucene in Action (amazon, manning) this year, but our own Rafał Kuć published his Solr 3.1 Cookbook (amazon) in July. Rafał's cookbook includes a number of recipes that can make your life easier when it comes to solving common problems with Apache Solr. Another book, Apache Solr 3 Enterprise Search Server (amazon) by David Smiley and Eric Pugh was published in November. This is a major update to the first edition of the book and it covers a wide range of functionalities of Apache Solr.
This article was originally published on Sematext Blog by @sematext
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)