NoSQL Zone is brought to you in partnership with:

Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2576 posts at DZone. You can read more from them at their website. View Full User Profile

Cassandra NoSQL Database an Apache Top Level Project

  • submit to reddit
After Facebook made the Cassandra project open source in 2008, the highly scalable, non-relational distributed database proved its mettle at other companies including Cisco, Digg, and Rackspace.  Now the well-known NoSQL database has proven itself as an active Apache project with enough training to graduate from the incubator.  The board recently approved a resolution to adopt Cassandra as a Top Level Project.  The new "Apache Cassandra" should serve as another example of a high-profile, non-relational data solution and success story

Cassandra was born out of Facebook's need to store reverse indices of Facebook messages that users send and receive while communicating with their friends.  The solution needed to scale incrementally while remaining cost effective.  Traditional data storage was not an option, so Facebook created a non-relational solution called "Cassandra".  The project was designed by Avinash Lakshman and Prashant Malik.  Lakshman was one of the authors of Amazon's Dynamo, another large-scale NoSQL database.  In many ways, Cassandra is like the second version of Dynamo, or a marriage of Dynamo and Google's BigTable.  Lakshman further describes Cassandra's data model and the distributed properties provided by the system:

Data Model

  • Every row is identified by a unique key. The key is a string and there is no limit on its size.
  • An instance of Cassandra has one table which is made up of one or more column families as defined by the user.
  • The number of column families and the name of each of the above must be fixed at the time the cluster is started. There is no limitation the number of column families but it is expected that there would be a few of these.
  • Each column family can contain one of two structures: supercolumns or columns. Both of these are dynamically created and there is no limit on the number of these that can be stored in a column family.
  • Columns are constructs that have a name, a value and a user-defined timestamp associated with them. The number of columns that can be contained in a column family is very large. Columns could be of variable number per key. For instance key K1 could have 1024 columns/super columns while key K2 could have 64 columns/super columns.
  • “Supercolumns” are a construct that have a name, and an infinite number of columns assosciated with them. The number of “Supercolumns” associated with any column family could be infinite and of a variable number per key. They exhibit the same characteristics as columns.

Distribution, Replication and Fault Tolerance

  • Data is distributed across the nodes in the cluster using Consistent Hashing based and on an Order Preserving Hash function. We use an Order Preserving Hash so that we could perform range scans over the data for analysis at some later point.
  • Cluster membership is maintained via Gossip style membership algorithm. Failures of nodes within the cluster are monitored using an Accrual Style Failure Detector.
  • High availability is achieved using replication and we actively replicate data across data centers. Since eventual consistency is the mantra of the system reads execute on the closest replica and data is repaired in the background for increased read throughput.
  • System exhibits incremental scalability properties which can be achieved as easily as dropping nodes and having them automatically bootstrapped with data.

Think of Cassandra as a large 4 or 5 level associative array.  Each dimention of the array has a free index that is based on the keys in that level.  The optional 5th level is the Supercolumn, which is where the real power comes from.  It can allow a simple key-value architecture to deal with sorted lists based on a specified index.  Cassandra has no single points of failure and it is able to scale from one node to several thousand in different data centers.  There is no central master, so data can be written to any node in the cluster and read from any other node.  Cassandra can be tuned to support more consistency or availability depending on your application.  There's also a high availability guarantee where if one node goes down, another one will step in and replace it smoothly.  

Insert Speed benchmark on a quad-core system with 2GB RAM  (Cassandra 0.4 vs. 0.5)
- Source

In the most recent version of Cassandra has improved concurrency across the board (see above).  Released last month, Cassandra 0.5 also adds load balancing and significantly improves bootstrap.  0.5 also features new tools, including JSON-based data import and export, new JMX metrics, and an improved command line interface.   Cassandra is one of three Apache projects that have graduated from the incubator this year.  The others are Apache Pivot and Apache Subversion.


Jonathan Ellis replied on Mon, 2010/02/22 - 5:36pm

If you're going to borrow charts from my blog, I'd appreciate a link.

You should also credit, from which paragraph 3 is lifted nearly verbatim.

Chris Goffinet replied on Mon, 2010/02/22 - 5:49pm

That's a nice graph, I should make one like it.


P.S. your site registration is the worst of any site I have ever experienced. I would rather pay you $1 to comment then go through that hell again.

Mitch Pronschinske replied on Tue, 2010/02/23 - 5:29pm in response to: Jonathan Ellis

Sorry about the chart.  The proper citation has been added.  The post was very informative.

Yaniv Nacfrtge replied on Thu, 2011/08/18 - 10:24am

New RPM build for the apache cassandra project hosted at google code.

Kookee Gacho replied on Thu, 2012/06/14 - 5:37am

NoSQL database systems are often highly optimized for retrieve and append operations and often offer little functionality beyond record storage (e.g. key-value stores). The reduced run time flexibility compared to full SQL systems is compensated by significant gains in scalability and performance for certain data models.-Madison Pharmacy Associates

Carla Brian replied on Wed, 2012/07/04 - 6:18pm

Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages. - Mercy Ministries

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.