Debasish specializes in leading delivery of enterprise scale solutions for various clients ranging from small ones to Fortune 500 companies. He is the technology evangelist of Anshin Software (http://www.anshinsoft.com) and takes pride in institutionalizing best practices in software design and programming. He loves to program in Java, Ruby, Erlang and Scala and has been trying desperately to get out of the unmanaged world of C++. Debasish is a DZone MVB and is not an employee of DZone and has posted 55 posts at DZone. You can read more from them at their website. View Full User Profile
NOSQL Movement - Excited with the coexistence of Divergent Thoughts
Today we are witnessing a great bit of excitement with the NoSQL
movement. Call it NoSQL (~SQL) or NOSQL (Not
Only SQL), the movement has a mission. Not all applications need to
store and process data the same way, and the storage should also be
architected accordingly. Till today we have always been force-fitting a
single hammer to drive every nail. Irrespective of how we process data
in our application we have traditionally stored them as rows and columns
in a relational database.
When we talk about really big write
scaling applications, relational databases suck big time. Normalized
data, joins, acid transactions are definite anti-patterns in write
scalability. You may think sharding will solve your problems by
splitting data into smaller chunks. But in reality, the biggest problem
with sharding is that relational databases have never been designed for
it. Sharding takes away many of the benefits that relational databases
have traditionally been built for. Sharding cannot be an afterthought,
sharding intrudes into the business logic of your application and
joining data from multiple shards is definitely a non trivial effort. As
long as you can scale up your data model vertically by increasing the
size of your box, that's possibly the sanest way to go for. But Moore ..
*cough* .. *cough* .. Even if you are able to scale up vertically, try
migrating a really large MySQL database. It will take hours, and even
days. That's one of the problems why some
companies are moving to schemaless databases when their
applications can afford to.
For horizontal scalability of an
application if we sacrifice normalization, joins and ACID transactions,
why should we use an RDBMS ? You don't need to .. Digg is moving to
Cassandra from MySQL. It all depends on your application and the kind of
write scalability that you need to achieve in processing of your data.
For read scalability, you can still manage using read-only slaves
replicating everything coming to the master database in realtime and
setting up a smart proxy router between your clients and the database.
biggest excitement that the NOSQL movement has created today is because
of the divergence of thoughts that each of the products is promising.
This is very much unlike the RDBMS movement which started as a single
hammer named SQL that's capable of munging rows and columns of data
based on the theory of mathematical set operations. And every
application adopted the same storage architecture irrespective of how
they process the data from within their application. One thing led to
another, people thought they can solve this problem with yet another
level of indirection .. and the strange thingy called an Object
Relational Mapper was born.
At last it needed the momentum of
the Web shaped data processing to make us realize that all data are not
processed alike. The storage that works so well for your desktop
trading application will fail miserably in a social application where
you need to process linked data, more in the shape of a graph. The NOSQL
community has responded with Neo4J, a
graph database that offers easy storage and traversal of graph
If you want to go big on write scalability, the only
way out is decentralization and eventual consistency. The CAP
theorem kicks in, and you need to compromise on at least one of
consistency, availability and network partition tolerance. Riak and Cassandra offer
decentralized data stores that can potentially scale indefinitely. If
your application needs more structure than a key-value database, you can
go for Cassandra, the distributed, peer-to-peer, column oriented data
store. Have a look at the nice article
from Digg which compares their use case between a relational storage
and the columnar storage that Cassandra offers. For a document oriented
database with all the goodness of REST and JSON, Riak is the option to
choose. Also Riak offers linked map/reduce with the option to store
linked data items, much in the way the Web works. Riak is truly a Web shaped data store.
CouchDB has yet another very
interesting value proposition in this whole ecosystem of NOSQL
databases. Most of the applications are inherently offline and need
seamless and painless replication facilities. CouchDB's B-Tree based
storage structure, append
only operations with MVCC based model of concurrency control,
lockless operations, REST APIs and incremental map/reduce operations
position it with a sweet enough spot in the space of local browser
storage. Chris Anderson, one of the core developers of CouchDB sums
up the value of CouchDB in today's Web based world very nicely ..
are the product of an HTML5 browser and a CouchDB instance. Their key
advantage is portability, based on the ubiquity of the html5 platform.
Features like Web Workers and cross-domain XHR really make a huge
difference in the fabric of the web. Their availability on every
platform is key to the future of the web."
MongoDB, like CouchDB is also a document
store. It doesn't offer REST out of the box, but it's based on JSON
storage. It has map/reduce as well, but also offers a strong suite of
query APIs much like SQL. This is the main sweet spot of MongoDB, which
plays very well to people coming from a SQL background. MongoDB also
offers master slave replication and has been working towards an
autosharding based scalability and failover support.
quite a few other data stores that offer solutions to problems that you
face in everyday application design. Caching, worker queues requiring
atomic push/pop operations, processing activity streams, logging data
etc. Redis and Tokyo Cabinet are nice fits for such use
cases. You can think of Redis as a memcached with a backup persistent
key-value database. It's single threaded, uses non-blocking IO and is
blazing fast. Redis, besides offering every day key/value storage also
offer list and sets to be stored along with atomic operations on each of
them. Pick the one that fits your bill the best.
interesting aspect is the interoperability between these data stores.
Riak, for example offers pluggable data backends - possibly we can have
CouchDB as the data backend for Riak (can we ?). Possibly we will also see a
Cassandra backend for Neo4J. It's extremely heartening to see that each
of these communities has a deep sense of cooperation in making the
entire ecosystem more meaningful and thriving.