NoSQL Zone is brought to you in partnership with:

I am a software architect working in service hosting area. I am interested and specialized in SaaS, Cloud computing and Parallel processing. Ricky is a DZone MVB and is not an employee of DZone and has posted 87 posts at DZone. You can read more from them at their website. View Full User Profile

CouchDB Cluster

04.06.2011
| 7849 views |
  • submit to reddit
Lets look at how one can layer a cluster on top of CouchDB.

Couch Cluster

A “Couch Cluster” is composed of multiple “partitions”. Each partition is composed of multiple replicated DB instances. We call each replica a “virtual node”, which is basically a DB instance hosted inside a "physical node", which is a CouchDB process running in a machine. “Virtual node” can migrate across machines (which we also call “physical node”) for various reasons, such as …
  • when physical node crashes
  • when more physical nodes are provisioned
  • when the workload of physical nodes are unbalanced
  • when there is a need to reduce latency by migrating closed to the client

Proxy

The "Couch Cluster" is frontend by a "Proxy", which intercept all the client's call and forward it to the corresponding "virtual node". In doing so, the proxy has a "configuration DB" which store the topology and knows how the virtual nodes are distributed across physical nodes. The configuration DB will be updated at more DBs are created or destroyed. Changes of the configuration DB will be replicated among the proxies so each of them will eventually share the same picture of the cluster topology.


In this diagram, it shows a single DB, which is split into 2 partitions (the blue and orange partitions). Each partition has 3 replicas, where one of them is the primary and the other two are secondaries.

Create DB
  1. Client call Proxy with URL=http://proxy/dbname; HTTP_Command = PUT /dbname
  2. Proxy need to determine number of partitions and number of replications is needed, lets say we have 2 partitions and each partition has 3 copies. So there will be 6 virtual nodes. v1-1, v1-2, v1-3, v2-1, v2-2, v2-3.
  3. Proxy also need to determine which virtual node is the primary of its partition. Lets say v1-1, v2-1 are primary and the rest are secondaries.
  4. And then Proxy need to determine which physical node is hosting these virtual nodes. say M1 (v1-1, v2-2), M2 (v1-2, v2-3), M3 (v1-3, v2-1).
  5. Proxy record its decision to the configuration DB
  6. Proxy call M1 with URL=http://M1/dbname_p1; HTTP_Command = PUT /dbname_p1. And then call M1 again with URL=http://M1/dbname_p2; HTTP_Command = PUT /dbname_p2.
  7. Proxy repeat step 6 to M2, M3

List all DBs
  1. Client call Proxy with URL=http://proxy/_all_dbs; HTTP_Command = GET /_all_dbs
  2. Proxy lookup the configuration DB to determine all the DBs
  3. Proxy return to client

Get DB info
  1. Client call Proxy with URL=http://proxy/dbname; HTTP_Command = GET /dbname
  2. Proxy will lookup the configuration DB for all its partitions. For each partition, it locates the virtual node that host the primary copy (v1-1, v2-1). It also identifies the physical node that host these virtual nodes (M1, M3).
  3. For each physical node, say M1, the proxy call it with URL=http://M1/dbname_p1; HTTP_Command = GET /dbname
  4. Proxy do the same to M3
  5. Proxy combine the results of M1, and M3 and then forward to the client

Delete DB
  1. Client call Proxy with URL=http://proxy/dbname; HTTP_Command = DELETE /dbname
  2. Proxy lookup which machines is hosting the clustered DB and find M1, M2, M3.
  3. Proxy call M1 with URL=http://M1/dbname_p1; HTTP_Command = DELETE /dbname_p1. Then Proxy call M1 again with URL=http://M1/dbname_p2; HTTP_Command = DELETE /dbname_p2.
  4. Proxy do the same to M2, M3

Get all documents of a DB
  1. Client call Proxy with URL=http://proxy/dbname/_all_docs; HTTP_Command = GET /dbname/_all_docs
  2. Proxy will lookup the configuration DB for all its partitions. For each partition, it randomly locates the virtual node that host a copy (v1-2, v2-2). It also identifies the physical node that host these virtual nodes (M1, M2).
  3. Proxy call M1 with URL=http://M1/dbname_p1/_all_docs; HTTP_Command = GET /dbname_p1/_all_docs.
  4. Proxy do the same to M2
  5. Proxy combine the results of M1, and M3 and then forward to the client

Create / Update a document
  1. Client call Proxy with URL=http://proxy/dbname/docid; HTTP_Command = PUT /dbname/docid
  2. Proxy will invoke "select_partition(docid)" to determine the partition, and then lookup the primary copy of that partition (e.g. v1-1). It also identifies the physical node (e.g. M1) that host this virtual node.
  3. The proxy call M1 with URL=http://M1/dbname_p1/docid; HTTP_Command = PUT /dbname_p1/docid

Get a document
  1. Client call Proxy with URL=http://proxy/dbname/docid; HTTP_Command = GET /dbname/docid
  2. Proxy will invoke "select_partition(docid)" to determine the partition, and then randomly get a copy of that partition (e.g. v1-3). It also identifies the physical node (e.g. M3) that host this virtual node.
  3. The proxy call M3 with URL=http://M3/dbname_p1/docid; HTTP_Command = GET /dbname_p1/docid

Delete a document
  1. Client call Proxy with URL=http://proxy/dbname/docid?rev=1234; HTTP_Command = DELETE /dbname/docid?rev=1234
  2. Proxy will invoke "select_partition(docid)" to determine the partition, and then lookup the primary copy of that partition (e.g. v1-1). It also identifies the physical node (e.g. M1) that host this virtual node.
  3. The proxy call M1 with URL=http://M1/dbname_p1/docid?rev=1234; HTTP_Command = DELETE /dbname_p1/docid?rev=1234

Create a View design doc
  1. Client call Proxy with URL=http://proxy/dbname/_design/viewid; HTTP_Command = PUT /dbname/_design/viewid
  2. Proxy will determine all the virtual nodes of this DB, and identifies all the physical nodes (e.g. M1, M2, M3) that host these virtual nodes.
  3. The proxy call M1 with URL=http://M1/dbname_p1/_design/viewid; HTTP_Command = PUT /dbname_p1/_design/viewid. Then proxy call M1 again with URL=http://M1/dbname_p2/_design/viewid; HTTP_Command = PUT /dbname_p2/_design/viewid.
  4. Proxy do the same to M2, M3

Query a View
  1. Client call Proxy with URL=http://proxy/dbname/_view/viewid/attrid; HTTP_Command = GET /dbname/_view/viewid/attrid
  2. Proxy will determine all the partitions of "dbname", and for each partition, it randomly get a copy of that partition (e.g. v1-3, v2-2). It also identifies the physical node (e.g. M1, M3) that host these virtual nodes.
  3. The proxy call M1 with URL=http://M1/dbname_p1/_view/viewid/attrid; HTTP_Command = GET /dbname_p1/_view/viewid/attrid
  4. The proxy do the same to M3
  5. The proxy combines the result from M1, M3. If the "attrid" is a map only function, the proxy will just concatenate all the results together. But if the "attrid" has a reduce function defined, then the proxy will invoke the view engine's reduce() function with rereduce = true. Then the proxy return the combined result to the client.

Replication within the Cluster
  1. Periodically, Proxy will replicate the changes of ConfigurationDB among themselves. This will ensure all the proxies are having the same picture of the topology.
  2. Periodically, Proxy will pick a DB, pick one of its partition, and replicate the changes from the primary to all the secondaries. This will make sure all the copies of each partition of DB are in sync.

Client data sync

Lets say the client also has a local DB, which is replicated from the cluster. This is important for occasionally connected scenario, where the client may disconnect with the cluster for a time period and work with the local DB for a while. Later on when the client connects back to the cluster again, the data between the local DB and the cluster need to be synchronized.

To replicate changes from the local DB to the cluster ...
  1. Client start a replicator, and send the POST /_replicate with {source : "http://localhost/localdb, target: "http://proxy/dbname"}
  2. The replicator, which has remembered the last seq_num of the source in the previous replication, and extract all the changes of the localDB since then.
  3. The replicator push these changes to the proxy.
  4. The proxy will examine the list of changes. For each changed document, it will call "select_partition(docid)" to determine the partition, and then lookup the primary copy of that partition and then the physical node that host this virtual node.
  5. The proxy will push this changed document to the physical node. In other words, the primary copy of the cluster will first receive the changes from the localDB. These changes will be replicated to the secondary copies at a latter time.
  6. When complete, the replicator will update the seq_num for the next replication.
To replicate the changes from the cluster to the localDB
  1. Client starts the replicator, which has remembered the last "seq_num" array of the cluster. The seq_num array contains all the seq_num of each virtual node of the cluster. This seq_num array is a opaque data structure which the replicator doesn't care.
  2. The replicator send a request to the proxy to extract the latest changs, along with the seq_num array
  3. The proxy first lookup who is the primary of each partition, and then it extract changes from them using the appropriate seq_num from the seq_num array.
  4. The proxy consolidate all changes from each primary copy of each partition, and send them back to the replicator, along with the updated array of seq_num.
  5. The replicator apply these changes to the localDB, and then update the seq_num array for the next replication.
References
Published at DZone with permission of Ricky Ho, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Shoaib Almas replied on Sat, 2012/08/25 - 5:58am

If create/update document results in writing the doc to only the primary virtual node in its partition, a read which would follow may be directed to a secondary virtual node and produce a stale version. The periodic sync-up does not guarantee me anything, it's just a best effort attempt to synchronize.

Java Forum

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.