Did you know? DZone has great portals for Python, Cloud, NoSQL, and HTML5!

I've been a zone leader with DZone since 2008. I work as a technical lead on a next generation tool suite using based on Eclipse. This means that I get to use Java everyday, and a selection of the best Eclipse technologies, from EMF to GEF to Xtext. Along with all of this, I do iOS development in my spare time. James is a DZone Zone Leader and has posted 509 posts at DZone. You can read more from them at their website. View Full User Profile

An Introduction To Cassandra: The Data Model

September 14, 2010 AT 12:51 AM
  • submit to reddit
This article is part of the DZone NoSQL Resource Portal, which is brought to you in collaboration with Neo Technology and DataStax. Visit the NoSQL Resource Portal for additional tutorials, videos, opinions, and other resources on this topic.

I'm fairly new to the whole NoSQL game, and one thing I keep hearing is how great Cassandra  is. Built by Facebook and open sourced in 2008, Cassandra is probably the most popular NoSQL implementation: "A massively scalable, decentralized, structured data store". Cassandra takes it's distribution features from Dynamo and the data model from BigTable.

Before we look at using Cassandra, we first need to understand the data model. For developers new to Cassandra, coming from a relational database background,  the data model can be a bit confusing. Here's a summary of how the Cassandra data model is composed:

Column

A Column is the most basic element in Cassandra: a simple tuple that contains a name, value and timestamp. All values are set by the client. That's an important consideration for the timestamp,as it means you'll need clock synchronization.



SuperColumn

A SuperColumn is a column that stores an associative array of columns. You could think of it as similar to a HashMap in Java, with an identifying column (name) that stores a list of columns inside (value). The key difference between a Column and a SuperColumn is that the value of a Column is a string, where the value of a SuperColumn is a map of Columns. Note that SuperColumns have no timestamp, just a name and a value.



ColumnFamily

A ColumnFamily hold a number of Rows, a sorted map that matches column names to column values.  A row is a set of columns, similar to the table concept from relational databases. The column family holds an ordered list of columns which you can reference by column name.

The ColumnFamily can be of two types, Standard or Super. Standard ColumnFamilys contain a map of normal columns,

 

meanwhile Super ColumnFamily's contain rows of SuperColumns.



KeySpaces

KeySpaces are the largest container, with an ordered list of ColumnFamilies, similar to a database in RDMS. The KeySpace is normally named after the application.

Multiple KeySpaces reside in clusters, the machines/nodes in a Cassandra instance. 

 

For another summary of the Cassandra data model, check out the (nicely titled) "WTF is a SuperColumn".

In the next article in this introduction series, we'll move onto the good stuff: using Cassandra in Java.

Neo Technology and DataStax are leading the charge for the NoSQL movement.  You can learn more about the Neo4j Graph Database in the project discussion forums and try out the new Spring Data Neo4j, which enables POJO-based development.  You can also see how Apache Cassandra, a ColumnFamily data store, is pushing the boundaries of persistence with cloud capabilities and deployments at SocialFlow and Netflix.