Rafal Kuc is a team leader and software developer. Right now he is a software architect and Solr and Lucene specialist. Mainly focused on Java, but open on every tool and programming language that will make the achievement of his goal easier and faster. Rafal is also one of the founders of solr.pl site where he tries to share his knowledge and help people with their problems. Rafał is a DZone MVB and is not an employee of DZone and has posted 75 posts at DZone. You can read more from them at their website. View Full User Profile

Solr 3.1+: JSON Update Handler

10.14.2011
| 5644 views |
  • submit to reddit

After the release of Solr 3.1 I decided to look into the extended list of formats through which we can update the indexes. Until now we had a choice of three kinds of formats with which we were able to provide data – XML, CSV, and so. called JavaBin. The release of Solr 3.1 introduces the fourth format – JSON.

Let’s start

The new handler (JsonUpdateRequestHandler) allows us to transfer data in the JSON format which in theory should translate into a smaller amount of data sent over the network and the speedup of indexing, as the JSON parser is theoretically faster than XML parsers. But let’s leave the performance for now.

Configuration

Let’s start by defining a handler. To do that add the following definition to the solrconfig.xml file (if you use the default solrconfig.xml file provided with Solr 3.1 than this handler is already defined):

<requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler" startup="lazy" />

The entry above defines a new handler that will be initialized when used for the first time (startup=”lazy”).

Indexing

The next step is to prepare the data – of course in JSON format. Here’s an example showing two documents in one file called data.json:

{

"add": {
  "doc": {
    "id" : "123456788",
    "region" : ["abc","def"],
    "name" : "ABCDEF",
  }
}

,
"add": {
  "doc": {
    "id" : "123456789",
    "region" : ["abc","def"],
    "name" : "XYZMN",
  }
}

}

Such prepared file can be sent to the /update/json address and thus be indexed. Remember to send a commit command to the appropriate address (standard /update) in order to tell Solr to open a new index searcher.

Performance

At the end I left myself what I’m really most interested in – the performance of the new handler. According to the information stored in JIRA system we can be expect that JsonUpdateRequestHandler will be faster than its counterpart processor of XML format. To examine this, I prepared the files of 10.000, 100.000 and 1 million documents. Every document contained an identifier (string field), two regions (String field, multivalued) and the name (text field). One file was saved in the JSON format, the second one was saved in XML format, the third one was saved in CSV format. All files were then indexed separately. Here is an outcome of this simple test:

Number of documentsData weightXML indexing timeJSON indexing timeCSV indexing time
10.000JSON:1.19MB
XML:1.88MB
CSV: 394KB
954ms734ms732ms
100.000JSON:12.4MB
XML:19.3MB
CSV: 4.33MB
7656ms6222ms6203ms
1.000.000JSON:129MB
XML:197MB
CSV: 48.1MB
126625ms119023ms108234ms

The conclusions suggest themselves. First, XML data is relatively larger than the one written in JSON format (the difference is about 35%). However, a file stored in JSON format, is larger (which might be expected) than the one written in the CSV. If you send data not on the local network, the size is relevant – the difference in file size is significant enough that it is worth thinking about changing the XML to any of the formats that require less space.

Indexation time

Another thing is the indexing time. Leaning on the results of this simple test we can think that JsonUpdateRequestHandler is slightly (about 7 – 9%) faster than the XmlUpdateRequestHandler. As you can see, the difference is similar for JsonUpdateRequestHandler and CSVRequestHandler, where the handler operates on files in CSV format is faster than its counterpart that operates in JSON format by about 7 to 9%. Let’s hope that when the noggit library comes out of Apache Labs, its performance will be even greater, and thus we will see even faster JsonUpdateRequestHandler.

 

References
Published at DZone with permission of Rafał Kuć, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Robert Craft replied on Thu, 2012/01/26 - 5:56am

Solr 3.1 Release Highlights * Numeric range facets (similar to date faceting). * New spatial search, including spatial filtering, boosting and sorting capabilities. * Example Velocity driven search UI at http://localhost:8983/solr/browse * A new termvector-based highlighter * Extend dismax (edismax) query parser which addresses some missing features in the dismax query parser along with some extensions. * Several more components now support distributed mode: TermsComponent, SpellCheckComponent. * A new Auto Suggest component. * Ability to sort by functions. * JSON document indexing * CSV response format * Apache UIMA integration for metadata extraction * Leverages Lucene 3.1 and it's inherent optimizations and bug fixes as well as new analysis capabilities. * Numerous improvements, bug fixes, and optimizations.

Spring Security

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.