NoSQL Zone is brought to you in partnership with:

Ilya Sterin is a software engineer with Nextrials, a clinical trials data management software company and also consults for a variety of startups, specifically dealing with scalable and distributed system architectures. Ilya’s also a book and blog author and avid user and contributor to open source software. When not hacking on yet another software project, Ilya enjoys spending his time coaching his 10 year-old son’s ever growing amount of sports teams. Ilya is a DZone MVB and is not an employee of DZone and has posted 5 posts at DZone. View Full User Profile

NOSQL Databases for Web CRUD (CouchDB) - Shows/Views

02.11.2010
| 9358 views |
  • submit to reddit

There are many applications that easily lend themselves to CRUD paradigm. Even if only 80% of the application’s functionality is pure CRUD, one can benefit from a simpler storage model. For so many years many (including myself) thought that storing application state which needs to be used in an enterprise-grade applications, meant we had one option, an RDBMS. Not that other models weren’t available, but their prevalence was not as high which made one question the quality/stability and long term health of such software. So we’ve gotten accustomed to approaching state persistence by sticking everything into one hole. If it didn’t fit, we trimmed it, cut it, squeezed it, stumped on it, but we made it fit. Then when it was time to pull out, ah, we repeated the procedure. ORMs are one of the most popular remedies for such procedures. But if one has ever developed a complex data model and actually took the time to think about the data access strategy, on both ends, application and RDBMS, you’d quickly run into many limitations of the ORM model. I guess you can abstract away the impedance mismatch only so much, but watch out for these leaky abstractions. So if you’re still rusty on your SQL and the relational theory because the great Gods promised that you’ll never have to worry about it if you use ORM, you better get to learning, unless you’re planning maintaining a ToDo List application for the rest of your life.

So, with this out of the way, let’s talk about real data persistence. So there are many applications (especially web applications), that don’t lend themselves very well into the relational persistence model. There are many reasons for this, but those who have ever had to beat their heads against the wall to bend the relational model to persist data know what I mean. By the time you’re done, you’re using and RDBMS to store non-relational data and all the benefits of the relational model are moot. You might as well store your data in an excel spreadsheet. So what are some of these reasons?

  1. Highly dynamic structure (relational schemas are rather static, if you’re doing it the relational way (no tall/skinny tables))
  2. Data model is not very relational. That speaks for itself, but many don’t really know when and how to identify this, as we’ve spend so much time identifying relations that don’t exist or are irrelevant to the application.
  3. Your relational schema is denormalized to the point where you’re no longer benefiting from relational database features like enforcing consistency and reducing redundancy in the data.
  4. Your relational database is bending backwards to accommodate your read/write throughput even after you denormalized (which itself is a reason to look elsewhere) and optimized, forcing you to continuously have to scale up to allow for increased load.
  5. You continuously marshall/unmarshall data (ORM???) to persist it and then to transform it to another data format.

Touching a bit more on bullet #5; Lots of software is written using good OO principles of encapsulation. Encapsulation is the heart of software abstractions and is probably the most important principle. But people tend to abuse it. Abstractions are good when they add value, but marshalling and unmarshalling one data structure into another without a good apparent reason, other than you don’t have to learn how to deal with a particular data structure, I’m not sure is such a good reason. So many software projects use ORM for the sake of not learning SQL, but how far can you actually get? ORM is a perfect example of a leaky abstraction. So many projects retrieve data from a web view in JSON or url-encoded format and marshall that into objects, only to validate the data and persist it using an ORM. So now you’ve unmarshalled the data from JSON to an object graph to just then marshall it again into a SQL query to send to the database. Do we really need these superfluous steps?

I’m sure there are other reasons I haven’t mentioned here. These reasons I personally faced when making my decisions.

A rational way of deciding on data persistence is not to automatically start writing a DDL script or grab your ER diagram tool, but rather look at what data you have, how would this data persist in a “natural” way, how does the client software need to access this data, what are the performance/scalability considerations and then go out and look at different persistence models to find the best match.

In my latest project, I had to think about a way to persist hierarchical data. This data will be accessed through some web medium (browser, mobile client, etc…) majority of the time. One of the web interfaces will be an ajax enabled web app, another will be an iphone and/or adroid app. JSON is communication protocol lingua franca of the web. Some will argue it’s XML, but I’ll keep my XML opinions to myself at this point.

CouchDB is a document database which one would call key/value store. It allows for storage of JSON documents that are uniquely identified by keys. Sounds interesting, not really. There are tens of other databases that have same capabilities, so why CouchDB? Well in one short sentence: CouchDB is build one the web and for the web. So what does that really mean? Well, besides the JSON storage structure and its innate ability to scale horizontally, they’ve build some pretty awesome features that makes it very appealing for a particular type of an application. The task is to decide whether the application you’re building is that application. So not to make this post any longer than it already is with my rant, let me describe and demonstrate some of the features that I’ve used over the last few days and why they are relevant.

Please make sure you have couchdb 0.10.* version installed as well as curl command line utility. For installation instructions see http://wiki.apache.org/couchdb/Installation

Once couchdb is installed, you can start it using the couchdb command. Depending on your setup, you can do…

sudo couchdb

A little bit of a background though before we get any further…

We’re going to store hierarchical data, which JSON is a natural fit for. One of the issues we have, is that in our industry, there are numerous data standards and they are all defined either in XML or in some delimited rectangular format. One major use case involves performing CRUD operations on the data from variety of sources (web app, mobile app, etc…) as well as being able to emit this data in one of the industry standard formats for integration purposes.

CouchDB exposes a RESTful API, so it’s rather easy to use it from any language which supports HTTP. Most popular languages have abstraction libraries on top of that, to abstract away the HTTP abstraction. Here is a list of available clients: http://wiki.apache.org/couchdb/Basics. For our purposes we’re going to use curl, a command line utility which allows us to make HTTP requests. So let’s see how we can easily accomplish this with CouchDB.

Now that CouchDB is successfully running, let’s create a database and insert some sample data…

curl -X PUT “http://localhost:5984/sample_db”

Above line create a database called sampledb. If the command is successfule, you will see the following output: _{“ok”:true}. Now lets add three files to this database. The JSON data files which we’re sending below are found in code snippets below labeled accordingly, so make sure they are in the directory from which you’re running the below commands.

curl -X PUT -d @rec1.json “http://localhost:5984/sample_db/record1” curl -X PUT -d @rec2.json “http://localhost:5984/sample_db/record2” curl -X PUT -d @rec3.json “http://localhost:5984/sample_db/record3”

Again, each command should yield a JSON response with “ok” set to true if the add succeeded. Here is what one would expect from the first command: {“ok”:true,”id”:”record1”,”rev”:”1-7c15e9df17499c994439b5e3ab1951d2”}. Again, ok is set to true making this a success response. The id field is set to the name of the record which we created. You can see that names are set through the URL as they are just resources in the world of REST. The rev field displays the revision of this document. CouchDB’s concurrency model is based on MVCC, though it versions the documents as it updates them, so each document modification gets it’s unique revision id. You can read more about this in CouchDB’s architecture and API documentation.

rec1.json

 {
"name": "John Doe",
"date": "2001-01-03T15:14:00-06:00",
"children": [
{"name": "Brian Doe", "age": 8, "gender": "Male"},
{"name": "Katie Doe", "age": 15, "gender": "Female"}
]
}

 

rec2.json

{
"name": "Ilya Sterin",
"date": "2001-01-03T15:14:00-06:00",
"children": [
{"name": "Elijah Sterin", "age": 10, "gender": "Male"}
]
}

 

rec3.json

{
"name": "Emily Smith",
"date": "2001-01-03T15:14:00-06:00",
"children": [
{"name": "Mason Smith", "age": 3, "gender": "Male"},
{"name": "Donald Smith", "age": 2, "gender": "Male"}
]
}

Now that we have the data persisted, let’s talk about some strategies for getting the data out.

CouchDB supports views. They are used to query and report on the data stored in the database. Views can be permanent, meaning they are stored in CouchDB as named queries and are accessed through their name. Views can also be temporary, meaning they are executed and discarded. CouchDB computes and stores view indexes, so view operations are very efficient and can theoretically (and I believe practically) span across remote nodes. Views are written as map/reduce operations, though they land themselves well for distribution. Here is an example of a map function in a view. (Reduce functions are optional if your query requires aggregating result sets)

 function(doc) {
if (doc.name == "Ilya Sterin") {
emit(null, doc);
}
}

There are other two really cool features, which allow more effective data filtering and transformation. These features are shows and lists. The purpose of shows and lists is to render a JSON document in a different format. Shows allow to transform a single document into another format. A show is similar to a view function, but it takes two parameters function(doc, req), doc is the document instance being iterated and request is an abstraction over CouchDB request object. Here is a simple show function…

function(doc, req) {
var person = ;
person.@name = doc.name;
person.@joined = doc.date;
person.children = ;
if (doc.children) {
for each (var chldInst in doc.children) {
var child = ;
child.text()[0] = chldInst.name;
child.@age = chldInst.age;
child.@gender = chldInst.gender;
person.children.appendChild(child);
}
}
return {
'body': person.toXMLString(),
'headers': {
'Content-Type': 'application/xml'
}
}
}

 

The xml function and inlines you see here is the e4x which adds native support as a part of ECMAScript and is implemented in the embedded javascript engine Spidermonkey, which CouchDB uses.

This show function, takes a particular JSON record and turns it into XML. Creating a show is pretty simple, you just encapsulate the function above into a design document and create the record through PUT.

Here is the design document for the show above…

xml_show.json


{
"shows": {
"toxml": "Here you inline the show function above. Make sure all double quotes are escaped..."
}
}

Once you have the design document, create it…

curl -X PUT -H “Content-Type: application/json” -d @xml_show.json “http://localhost:5984/sample_db/_design/shows”

Note: in (…./_design/shows), shows is just a name of the design document, you can call it what ever you want

Now let’s invoke the show

curl -X GET “http://localhost:5984/sample_db/_design/shows/_show/toxml/record1”

Here is the output

<person name="John Doe" joined="2001-01-03T15:14:00-06:00">
<children>
<child age="8" gender="Male">Brian Doe</child>
<child age="15" gender="undefined">Katie Doe</child>
</children>
</person>

 

So, that was super easy, we stored our document which required no code on our behalf and then we retrieved it with minimal effort by using ECMAScript’s e4x facilities.

So how would I transform a record collection or view results into a different format? Well, this is where lists come in. Lists are similar to shows, but they are applied to the results of an already present view. Here is a sample list function.

function(head, req) {
start({'headers': {'Content-Type': 'application/xml'}});
var people = ;
var row;
while (row = getRow()) {
var doc = row.value;
var person = ;
person.@name = doc.name;
person.@joined = doc.date;
person.children = ;
if (doc.children) {
for each (var chldInst in doc.children) {
var child = ;
child.text()[0] = chldInst.name;
child.@age = chldInst.age;
child.@gender = chldInst.gender;
person.children.appendChild(child);
}
}
people.appendChild(person);
}
send(people.toXMLString());
}

 

Again, you encapsulate this list function into a design document, along with a simple view function…

xml_list.json

  {
"views": {
"all": {
"map": "function(doc) { emit(null, doc); }"
}
},
"lists": {
"toxml": "Here you inline the show function above. Make sure all double quotes are escaped as it must be stringified due to the fact that JSON can't store a function type."
}
}

 

Now, we create the design document

curl -X PUT -H “Content-Type: application/json” -d @xml_list.json “http://localhost:5984/sample_db/_design/lists”

Once the design document is created, we can request our xml document listing all person records

curl -X GET http://localhost:5984/sampledb/design/lists/_list/toxml/all

And the output is

<people>
<person name="John Doe" joined="2001-01-03T15:14:00-06:00">
<children>
<child age="8" gender="Male">Brian Doe</child>
<child age="15" gender="Female">Katie Doe</child>
</children>
</person>
<person name="Ilya Sterin" joined="2001-01-03T15:14:00-06:00">
<children>
<child age="10" gender="Male">Elijah Sterin</child>
</children>
</person>
<person name="Emily Smith" joined="2001-01-03T15:14:00-06:00">
<children>
<child age="3" gender="Male">Mason Smith</child>
<child age="2" gender="Male">Donald Smith</child>
</children>
</person>
</people>

So you can see how shows and lists are very useful and provide a convenient way to transform views into different formats.

As you can see, we created the database and stored data. No code was required to make that happen, just collect the data through your application and make a CouchDB REST request. We also added some custom functionality of transforming the data for multiple client consumption by using shows and views. In my opinion, CouchDB is a great step towards what one could call web/cloud scale database. It has awesome abilities to integrate with web technologies and it can scale to support the ever increasing web scale of data. In other words, it fits some application models like a glove.

I barely even scraped the tip of the iceberg of what CouchDB can do. We haven’t talked about result aggregates which can be achieved with map/reduce, we also haven’t discussed data validation and security. These features might be a top of some future posts.

You can find more related blog posts/information at Ilya's blog.

Published at DZone with permission of Ilya Sterin, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: