Hacking on GraphHopper - a Java road routing engine. Peter has posted 62 posts at DZone. You can read more from them at their website. View Full User Profile

Jetslide uses ElasticSearch as Database

07.13.2011
| 8555 views |
  • submit to reddit

This post explains how one could use the search server ElasticSearch as a database. I’m using ElasticSearch as my only data storage system, because for Jetslide I want to avoid maintenance and development time overhead, which would be required when using a separate system. Be it NoSQL, object or pure SQL DBs.

ElasticSearch is a really powerfull search server based on Apache Lucene. So why can you use ElasticSearch as a single point of truth (SPOT)? Let us begin and go through all – or at least my – requirements of a data storage system! Did I forget something? Add a comment :) !

CRUD & Search

You can cread, read (see also realtime get), update and delete documents of different types. And of course you can perform full text search!

Multi tenancy

Multiple indices are very easy to create and to delete. This can be used to support several clients or simply to put different types into different indices like one would do when creating multiple tables for every type/class.

Sharding and Replication

Sharding and replication is just a matter of numbers when creating the index:

curl -XPUT 'http://localhost:9200/twitter/' -d '
index :
    number_of_shards : 3
    number_of_replicas : 2'

You can even update the number of replicas afterwards ‘on the fly’. To update the number of shards of one index you have to reindex (see the reindexing section below).

Distributed & Cloud

ElasticSearch can be distributed over a lot of machines. You can dynamically add and remove nodes (video). Additionally read this blog post for information about using ElasticSearch in ‘the cloud’.

Fault tolerant & Reliability

ElasticSearch will recover from the last snapshot of its gateway if something ‘bad’ happens like an index corruption or even a total cluster fallout – think time machine for search. Watch this video from Berlin Buzz Words (minute 26) to understand how the ‘reliable and asyncronous nature’ are combined in ElasticSearch.

Nevertheless I still recommend to do a backup from time to time to a different system (or at least different hard disc), e.g. in case you hit ElasticSearch or Lucene bugs or at least to make it really secure :)

Realtime Get

When using Lucene you have a real time latency. Which basically means that if you store a document into the index you’ll have to wait a bit until it appears when you search afterwards. Altought this latency is quite small: only a few milliseconds it is there and gets bigger if the index gets bigger. But ElasticSearch implements a realtime get feature in its latest version, which makes it now possible to retrieve the object even if it is not searchable by its id!

Refresh, Commit and Versioning

As I said you have a realtime latency when creating or updating (aka indexing) a document. To update a document you can use the realtime get, merge it and put it back in the index. Another approach which avoids further hits on ElasticSearch, would be to call refresh (or commit in Solr) of the index. But this is very problematic (e.g. slow) when the index is not tiny.

The good news is that you can again solve this problem with a feature from ElasticSearch – it is called versioning. This an identical to the ‘application site’ optimistical locking in the database world. Put the document in the index and if it fails e.g. merge the old state with the new and try again. To be honest this requires a bit more thinking using a failure-queue or similar, but now I have a really good working system secured with unit tests.

If you think about it, this is a really huge benefit over e.g. Solr. Even if Solrs’ raw indexing is faster (no one really did a good job in comparing indexing performance of Solr vs. ES) it requires a call of commit to make the documents searchable and slows down the whole indexing process a lot when comparing to ElasticSearch where you never really need to call the expensive refresh.

Reindexing

This is not necessary for a normal database. But it is crucial for a search server, e.g. to change an analyzer or the number of shards for an index. Reindexing sounds hard but can be easily implemented even without a separate data storage in ElasticSearch. For Jetslide I’m storing not single fields I’m storing the entire document as JSON in the _source. This is necessary to fetch the documents from the old index and put them into the newly created (with different settings).

But wait. How can I fetch all documents from the old index? Wouldn’t this be bad in terms of performance or memory for big indices? No, you can use the scan search type, which avoids e.g. scoring.

Ok, but how can I replace my old index with the new one? Can this be done ‘on the fly’? Yes, you can simply switch the alias of the index:

curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{ "remove" : { "index" : "userindex6", "alias" : "userindex" } },
{ "add" : { "index" : "userindex7", "alias" : "uindex" } }]
}'

Performance

Well, ElasticSearch is fast. But you’ll have to determine for youself if it is fast enough for your use case and compare it to your existing data storage system.

Feature Rich

ElasticSearch has a lot of features, which you do not find in a normal database. E.g. faceting or the powerful percolator to name only a few.

Conclusion

In this post I explained if and how ElasticSearch can be used as a database replacement. ElasticSearch is very powerfuly but e.g. the versioning feature requires a bit handwork. So working with ElasticSearch is comparable more to the JDBC or SQL world not to the ORM one. But I’m sure there will pop up some ORM tools for ElasticSearch, although I prefer to avoid system complexity and will always use the ‘raw’ ElasticSearch I guess.

 

From http://karussell.wordpress.com/2011/07/13/jetslide-uses-elasticsearch-as-database/

Published at DZone with permission of its author, Peter Karussell.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)