Performance Zone is brought to you in partnership with:

Software developer and frequent open-source contributor. Writing mostly for .NET, but also Java and C/C++. Really likes fiddling with data, texts especially, so he frequently finds himself working on databases or search engines, usually combining both. Itamar is a DZone MVB and is not an employee of DZone and has posted 31 posts at DZone. You can read more from them at their website. View Full User Profile

Re: RavenDB Document Indexing Process

10.15.2013
| 3079 views |
  • submit to reddit

A couple of months ago we published an excerpt from the RavenDB in Action book I'm writing. It contained the first sections of the chapter discussing the indexing process of RavenDB, and I was really happy with the feedback we got on it. Both excerpt readers and book readers who have read the entire chapter really seemed to like it, and we recently pushed out an update incorporating their feedback and what could be improved to make it even better.

There are two topics with RavenDB that are really important to grasp fully, and can sometimes be non-trivial to explain, especially to people with strong SQL background: RavenDB indexes, and document-oriented modeling. These two topics have strong relation (no pun intended) - understanding one will lead you to working better with the other. And assuming a wrong assumption with one will lead to incorrectly working with the other.

The trigger for this post is Alex Popescu's thoughts on the published excerpt, which I just read:

Asynchronous indexing is tricky. While it looks like addressing the performance penalty on both read and write, it actually has a few drawbacks:

  1. immediate inconsistency: with asynchronous indexes, there are no consistency guarantees.
  2. impossibility of defining unique indexes. When using async indexes, it’s impossible to define unique indexes as by the time the index would be updated it would be too late to acknowledge the client that the uniqueness constraint is not satisfied.
  3. complicated crash recovery. With async indexing, the server must be able to continue the indexing process from where it was left. If this information is not persistent, crash recovery might lead to permanent data inconsistencies.

Alex is completely right about all 3, but the drawbacks he mentions are more of features than actual problems.

Let me quickly address #1 and #3 before getting to the important part of the post - yes, you don't have immediate consistency by design. In a distributed environment you have to embrace the concept of Eventual Consistency if you wish to build a working and high-available system. As I point out in the chapter and that appears in the excerpt as well - it is actually quite natural for us as human-beings, even with the behavior of older systems we use daily, to cope with Eventual Consistency. It may be important to highlight at this point that Eventual Consistency only applies to the indexes themselves - the actual data is stored in a fully ACID, immediate consistent, separate storage engine which is a part of RavenDB as well.

Regarding crash recovery - well, obviously that part exists as well. Crashes will not affect indexing (in the worst case they will trigger complete reindexing), and RavenDB is able to detect corrupt or incomplete indexes. Otherwise it wouldn't have been much of a reliable database, would it?

In his 2nd point Alex is concerned about something that simply does not exist in RavenDB - unique constrains through indexes. This is indeed a concept widely used in the world of Relational Databases, but you have to remember indexes update there as part of the actual write transaction. A database record would update along with the indexes that update affects. This is exactly what RavenDB tries to avoid - costly, long writes that may end badly.

With RavenDB, data is data and is being updated directly in the data store itself, or Document Store as I call it in the chapter. Once RavenDB can guarantee the data is safely stored, the transaction will return to let the client continue his work. For most applications, by the time the user needs to retrieve this data back via a query, the asynchronous indexing process has completed its work. If not, there are ways to know this and force waits - and those are discussed in the chapter as well. Either way, the data is always available via a Load operation, IMMEDIATELY after the write transaction has completed.

To achieve unique-constraints functionality with RavenDB, you can and should use the Document Store, which as I mentioned is completely ACID. There is a very good example of this in action here, which demonstrates guaranteeing uniqueness of user emails in a system. The write transaction will fail because it will try to overwrite an existing document, while the session was configured to disallow overwrites.

Indexes in RavenDB are about finding data, not about maintaining business logic. Such logic should exist solely in code, within the business logic in the Model or in the code actually making changes. Indexes shouldn't be used neither for unique-constraints nor for enforcing referential integrity. Having business logic defined external to the code - may it be an index definition or a stored procedure - is really a SQL concept that I always hated.

By correctly applying document-oriented modelling you can avoid most such issues, and by correctly building indexes you can greatly help your modelling efforts. Learn about the 2 as one - my $0.2.



Published at DZone with permission of Itamar Syn-hershko, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)