Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

Rich Documents Processing: On the Search or Application Side

06.18.2012
| 6921 views |
  • submit to reddit

When indexing so called “rich documents” we should sometimes think about, where we want those documents to be processes – should we send them to Apache Solr (or other search engine, like ElasticSearch) and forget about them or whether we should use Apache Tika before sending the document and send the extracted content along with other information for indexation.

Options

As I wrote a few lines above we have two options – the first one is sending the binaries to search engine and use ExtractingRequestHandler (information about integrating Solr with Apache Tika can be found here) in Solr case, so it will make all the work for us. The second option is to use the same functionality (almost the same) to parse binary documents and get their contents before sending them to Solr. Of course there is a third option, not possible in most cases – get the documents you want to index in a format understandable by Solr :)

Processing on the Search Server Side

The simplest approach is to process your “rich documents” on the search server side. Lets assume its Apache Solr. We configure the ExtractingRequestHandler in the way we want it to work and we forget about everything else. But its not the right approach every time. You can imagine a situation when your indexing server is almost 100% utilized. If you would add another source of generating load you would probably suffer from performance problems. In such cases you will probably want to do it the other way.

Processing Outside of the Search Server

If the amount of rich documents is huge or your indexing server is almost completely utilized than it may be a good idea to process your binary files before sending them to your indexing server. Using Apache Tika for example we are able to build (quite easily) a good and reliable solution to process rich documents in your application. Of course such approach require a bit of knowledge about Java (or any other language you will use for content extraction). Such approach can save us from a situation where our indexing server is overloaded and because of the amount of data we can’t do anything with it.

A Few Words at the End

Once every few weeks we will be publishing posts that don’t cover one of the Apache Solr functionalities, but instead discuss some overall search problem or describe architecture of system with search as their part. We hope that such posts will allow us and you to look a bit wider on search topics than only from Apache Solr point of view.

Published at DZone with permission of Rafał Kuć, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Rafał Kuć replied on Mon, 2012/06/18 - 4:29pm

@DZone guys - pay attention to the authors - Kelvin Tan is not an author of the original post here.

Will Soprano replied on Wed, 2012/06/20 - 9:03am

Sorry for the mixup, I fixed the error

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.