NoSQL Zone is brought to you in partnership with:

I am a developer and project manager of CUBRID open source database. I often interact with CUBRID user community answering to their questions and helping them resolve CUBRID related issues. Beside the CUBRID project, I contribute to various open source projects like Hibernate, CodeIgniter, Yii, etc. Personally I am very interested in big data, scalability, and single page real-time Web apps. I enjoy reading (a lot!) and writing tech articles. I have written over one hundred blogs at http://www.cubrid.org/blog and even more tutorials. Esen is a DZone MVB and is not an employee of DZone and has posted 43 posts at DZone. You can read more from them at their website. View Full User Profile

The Availability and Operational Stability of NoSQL

01.02.2013
| 5103 views |
  • submit to reddit

Curator's Note: The content of this article was originally written by Hye Jeong Lee over at the official Cubrid blog.

What are the advantages of NoSQL compared to RDBMS? NoSQL products offer a number of advantages that are not offered by RDBMS products, including high performance, scalability, and availability. However, there is no product that is perfect in all aspects. When you closely examine NoSQL products, you can find some weak points as well as some outstanding benefits. For this reason, it is critical to use verified NoSQL products. In this article, I will analyze the distribution and availability of these products from the operational aspect. The selected targets are Cassandra, HBase and MongoDB.

We have already covered these three solutions in What is NoSQL for? and NoSQL Benchmarking. You can refer to these articles for introduction and performance comparison.

Fail-over and Data Consistency of Cassandra

Cassandra shows excellent performance in data distribution and availability. First, I will examine its distribution capability. Cassandra distributes data by using the consistent hashing.

 Cassandra's Consistent Hashing

By using consistent hashing, the client can search and find the node where the key is saved without querying the metadata. The client can find the key by calculating the hash value of the key, and find the node with only that hash value. One can think of consistent hashing as a series of hash values sequentially placed in a ring shape, with each node processing a section of the ring. If a node is added to the ring, the resource of a specific node (with a large amount of data) is split and assigned to the new one. If a node is removed, the resource assigned to that node is assigned to a neighboring node. In this way, Cassandra minimizes the number of nodes affected by adding/removing nodes.

Cassandra runs without a master server. In other words, there is no specific server that manages data distribution or failover. That means Cassandra has no Single Point Of Failure (SPoF). Instead of the master server, each node periodically shares metadata with others. This is called the gossip protocol. With the gossip protocol, a node can check whether another node is alive or dead.

Cassandra improves its availability by providing consistency levels. When this level is low, there may be no service downtime even when a node is down. For example, when one of the three nodes storing the replicated data (key) is down, a generic write request will not immediately return success because the replicate data cannot be written to the troubled node at the time of the request. However, when the consistency level is set as the number of quorum or 1, and the number of alive nodes is as many as the set value, success is returned immediately. For this reason, a request error will occur only when all the three nodes are down simultaneously.

But then, is it really true that data read/write is not affected by failed nodes?

To prove this, I reproduced a node failure under a constant stream of service requests while adding new nodes. The results were as follows:

Removing a Node and Adding a New Node

The following is the result of removing an existing node and adding a new node.

  1. When explicitly removing a node in the management tool, the data stored in the node is migrated to the remaining nodes and then the node is removed.
  2. When a new node is added, which is called bootstrapping, the added node communicates with the seed nodes to report that it has been added. Based on the configurations, the new node will bootstrap either to the range on the ring specified in the configurations, or to the range near the node with the most disk space used, that does not have another node bootstrapping into that range.
  3. The data is migrated from that node to the new node.
  4. The new node becomes available once the migration process is complete.

Adding a New Node after a Node Failure

The following is the result of adding a new node after a node failure.

  1. When a node is down, the data stored in that node is not migrated to other nodes, and the service is continued with two replicas. In other words, no error will be returned, even when service requests are received during this time.
  2. When a new node is added, the added node is assigned to a specific area of the ring. However, bootstrapping is not performed. Bootstrapping can be performed only when the number of data replications to be migrated is three.
  3. The added node has no data in it, but it handles the request because it can provide service. If a read request is received at this time, the node returns no data for the key. If the replica factor is 3 and the read consistency level is 1, 1/3 of the read requests may return no data. And if the consistency level is set to the value of quorum, 1/6 of the read requests may return empty data. In short, no read consistency is assured until the fail-over has been recovered. At the actual level 1, the coordinating node is most likely to receive the response from the new node first. This is the case because there is no I/O from the new node - it has no data. For this reason, a new node has a higher chance of returning empty data than the existing nodes.
  4. When Read Repair-ing the new node by using the management tool, the node is built with replication data read from other nodes. The read consistency is broken until the Read Repair is complete.

Cassandra can provide service without error, even when a node fails. Although Cassandra shows good performance when writing data, this is not so when reading data, because prolonged Rread Repair means prolonged data inconsistency. Therefore, to maintain read consistency during node failure, the following method should be applied.

  • Set the read consistency level to 'all' and execute a read. In this case, the latest data from all replicas can be obtained.
  • If a read is failed, Cassandra retries that read. This is because a Rread Repair at the first read may be used as the source of the restored data at the second read. However, this method assumes that Rread Repair is completed before the second read. (When the consistency level is low, the read repair is performed in the background, in a thread separate from the read operation processing thread.)

Failure Factors and Recovery Methods of HBase

HBase consists of several components, which are shown below (figure from HBase: The Definitive Guide):

 HBase Components

HRegionServer takes care of data distribution, while HMaster monitors HRegionServer. HDFS stores and replicates data, and Zookeeper keeps the location information of HMaster and elects a master. If redundancy is not established for each component, all of the components become SPoF.

HRegionServer can be detailed as follows: HRegionServer distributes data in a unit called a 'region.' A region is the result of dividing a big table where the sorted data is stored by the sorting key range (like a tablet in a big table). The key range information of each region is stored in a separate region, called the meta region. The region where the location of the meta region is stored is called the root region. In short, the region server stores a hierarchical tree consisting of root regions, meta regions, and data regions. If a region server is down, the region that the failed server covers is unavailable until that region is assigned to another server. Therefore, service downtime occurs until the region is recovered.

If this is the case, then how long will this downtime be?

Let's estimate the downtime while examining the failover process.

Region Server Failure

When a failure occurs in a region server, the data is recovered through the steps described below:

  1. HMaster detects failure and directs one of the other servers to perform the service of the failed server.
  2. The directed HRegionServer first reads the WAL (Write Ahead Log) of the new region, and recovers the MemStore of that region.
  3. Once MemStore is completely recovered, HMaster modifies the meta region that stores the location of the region to restart the service of the region.
  4. The data of the region stored in the disk will be freshly recovered by HDFS.

In the end, the recovery time required is as long as the time it takes to detect a failure, read the log and create a new region. Since the server assigned to the recovered region can access the data file in the HDFS, no data migration occurs on the HDFS. Therefore, the downtime is not significantly long.

HDFS Failure

The HDFS consists of one name node and several data nodes. Here, the name node is the node that stores meta data. So when this is down, service failure will occur. However, if one of the data nodes is down, no service failure will occur because the data has the replica. But the data stored in the failed data node will be built by one of the other nodes to recover the replica factor to normal (recovery). At this time, a huge data replication may occur, slowing down any read requests from the service or application. This is because the disk I/O for read is affected by the data replication.

Replication and Failover of MongoDB

MongoDB asynchronously replicates data from the master to the slave. The advantages of asynchronous replication are that it does not degrade performance of the master and the service performance does not degrade when a slave is added. However, the data will be lost when a failure occurs, because the data is inconsistent.

MongoDB recovers from failure in a similar way as the HA of DBMS. It elects a master when a failure occurs. Let's take a look at the two scenarios below.

Node Failure

Configure three nodes with one master and two slave nodes. Stop the master node. One of the two slaves will be automatically elected as a master. The time it takes to elect a new master when a failure occurs is a couple of seconds. This downtime is not that long. However, once the nodes are configured as a master and slaves and then the master is down again, no master is elected again.

Adding a Node

Enter data in the master. Assume that the size of the data is 5 GB, which is smaller than the memory size. Then, add a new slave to the master. In this case, adding a new slave does not degrade the performance of the master. It takes several minutes for the added slave to replicate all data.
In MongoDB, the degradation of performance due to a failed or added node is minimal. However, if a failure occurs in the nodes while the replicas of the master and the slave are inconsistent, the data that has not been replicated by the slave may be lost. In MongoDB, the master writes the operation history to the Oplog log in the local server and then the slave reads the log and stores it in its database to replicate. If a failure occurs while the slave has not yet finished reading the log of the master, the unread data will be lost. In addition, if the master log is full while the slave has not finished replicating the content, all the data in the master is read and stored in the slave, rather than being replicated in the log. This is called data sync. If a failure occurs in the master in the situation above, a large amount of data will be lost.

Conclusion

So far, I have reviewed failovers of Cassandra, HBase and MongoDB.

Cassandra offers high availability for Write operations. However, it takes a long time to recover data from a failure. This is because Cassandra identifies all the data to recover, and then reads and writes the latest version of each data. Also since it responds to service requests while the added node is still in the process of data recovery, an incorrect read result may be returned. In addition, Rread Repair is executed twice when reading the data to recover. Although it provides hinted handoff for operation execution failure as well as for node failure, an incorrect result may be returned if the data to be recovered is read first. Therefore, if the consistency level is not raised, it cannot be used for the services that require read processing.

Because of its configuration, HBase has many factors that may cause a failure. However, while Cassandra has to recover data during a failure, HBase does not need to recover data unless a failure occurs in the HDFS. This gives HBase a short downtime. The downtime during an HDFS failure is not that long either. The read performance may be hit while recovering data, but the data consistency is maintained. In this way, higher availability is offered if the SPoF part can become redundant.

MongoDB provides automatic failover and has a short downtime. However, its asynchronous replication method may cause data loss after a failover.

Thus, before choosing the database solution that will suit your purpose, you should consider these characteristics of each product.

For reference, CUBRID RDBMS provides synchronous High-Availability for data consistency which results in no data loss, though it lacks the performance of NoSQL solutions.
Published at DZone with permission of Esen Sagynov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Vitalii Tymchyshyn replied on Sat, 2013/01/05 - 12:30pm

I am sorry, but Cassandra analisys is crap. 

Bootstrapping works even with one node down

If one of three nodes is empty, quorum consistency level will give you OK data each time (it asks two nodes, at least one will have data).

Best method to repair data on node is to call repair with the nodetool explicitly.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.