Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2576 posts at DZone. You can read more from them at their website. View Full User Profile

For Every Evernote—Its Own Lucene Index!

08.29.2011
| 8391 views |
  • submit to reddit
Today I found an extremely interesting blog post on a very high-tech (yet open source) architecture that is being used by one of the most popular online text editing tools around: Evernote.

One of the clever architectural techniques they use to make their notes so convenient by being instantly shared and organized is by creating a shard for every single note containing 3 defacto open source technologies: MySQL, Tomcat, and Lucene.



The graphic shows you how each note gets a shard containing three different storage systems for Metadata, Resources, and Searchable Text.

"All of the metadata about each note goes into structured tables in MySQL. And by “metadata”, I mean all of the fields in the data model structures for a Note and its Resources, except for the Resource’s raw data body and any recognition/alternate data files.
Those Resource files are de-duplicated in software on each shard (using MD5+length) and then stored on a relatively simple hierarchical file system using a folder tree derived from the MD5 checksum.

The combination of MySQL and the file system allows us to store the full contents of the data model and support the vast majority of our API calls. Text-based searches on our servers require some sort of Full-Text Search (FTS) engine to provide any sort of usable performance across large data sets."  --Dave Engberg, Evernote

Evernote initially used MyISAM's FTS engine within MySQL itself to index the searchable text metadata in notes.  They tried a few things with MyISAM including batch updates, but they eventually gave up and switched to Apache Lucene - a proven search library.

Why did they make the change?  Evernote had high standards: "When users create or update notes, they expect those notes to immediately match any text searches," said Dave Engberg, the author of the post.  Only Lucene could give them the virtually synchronous text indexing for each individual note after its creation.

When you use Evernote, every single note now has its own Lucene search index occupying a separate directory on the file system.

It wasn't so simple, however, to maintain the level of performance that they wanted, so there was definitely some Lucene and MySQL (even hardware) tuning that was required.  Go ahead and read the post via the Resource Box link if you're interested in all the gory details of how they made Lucene work well for them.

Before you do, let's hear some thoughts from the search gurus out there (or just anyone really :)  )  Do you think Evernote's got the right idea?  Lucene is currently making twice as many IO operations as MySQL, but they expect they can bring that down with some eventual tuning.  

Do you think it would be worth the uncertainty and effort to try putting newer, less-proven technologies into the solution like NoSQL stores or ElasticSearch?

References
Reference: http://www.dzone.com/links/r/lucene_we_got_some_explaining_to_do_evernotes_coo.html

Comments

Nick Read replied on Mon, 2011/08/29 - 11:40pm

When you use Evernote, every single note now has its own Lucene search index occupying a separate directory on the file system.

Don't you mean that every user has their own index, not every note.

...instantly shared and organized is by creating a shard for every single note containing 3 defacto open source technologies: MySQL, Tomcat, and Lucene.

The article states that the data for each note is spread across 3 different storage mechanisms on each shard - there is not a single shard per note.

Justin Forder replied on Thu, 2011/09/01 - 2:22am in response to: Nick Read

The referenced Evernote blog post has a link to an earlier post giving a high level view of their architecture, which explains the sharding:
The core of the Evernote service is a farm of servers that we call “shards.” Each shard handles all data and all traffic (web and API) for a cohort of 100,000 registered Evernote users. Since we have more than 9 million users, this translates into around 90 shards.
Thanks for posting this - I use Evernote, and I found the details of their architecture very interesting.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.