NoSQL Zone is brought to you in partnership with:

Seth is the CTO at NuoDB. His main areas of focus are on the administration, security and resource management models, automation and the tools that drive these pieces. Seth is a DZone MVB and is not an employee of DZone and has posted 42 posts at DZone. You can read more from them at their website. View Full User Profile

Testing Network Failure Using NuoDB and Jepsen, Part 2

12.04.2013
| 2860 views |
  • submit to reddit

Greetings loyal readers! In our previous post, we talked about the Jepsen tester and about the various improvements we made to it. In this post, I'm going to walk through a Jepsen run made with the code from our github fork, explain the test setup and then go through the output explaining the behavior that Jepsen is producing in NuoDB.

Test Setup

This test, and many other runs of Jepsen, were done against 6 ubuntu machines. These are actual machines, rather than LXC containers. However, I was able to get one of my colleagues here to clone and test the fork on 6 amazon instances, and he got similar results. So, we expect that you'll be able to get similar results yourself. Here's what you're going to need:

  1. 6 Ubuntu 'boxes' (5 to run NuoDB, 1 to run the jepsen client)
  2. SSH access to all 6 boxes
  3. ssh-agent running with an appropriate identity for convenient access to all 6 machines
  4. sudo privledges on the 5 'boxes' running NuoDB (for iptables, among other things)
  5. All 6 machines need to be able to resolve each other's names (entries in /etc/hosts at least)

Next, you're going to want to install NuoDB on the 5 server boxes and get the brokers up and running. Then you'll want to kick off an SM and TE on each node. Here's the bash script I used to do this (it's also in the fork in setup/nuodb/setupJepsen.sh):

#!/bin/bash
#
#Sample bash script to setup a nuodb database running a TE and SM on each of 5 hosts
#
BROKER="n1"
TIMEOUT=30
ARCHIVE_DIR="/tmp/jepsen"
DOMAIN_PW="bird"
NUODBMANAGER_JAR="/opt/nuodb/jar/nuodbmanager.jar"
declare -a HOSTS=("n1" "n2" "n3" "n4" "n5")

for HOST in ${HOSTS[@]}
do
    java -jar $NUODBMANAGER_JAR --broker $BROKER --password $DOMAIN_PW --command "start process sm host $HOST database jepsen archive $ARCHIVE_DIR initialize yes options '--ping-timeout $TIMEOUT --commit remote:3 --verbose error,warn,flush'"
done

for HOST in ${HOSTS[@]}
do
    java -jar $NUODBMANAGER_JAR --broker $BROKER --password $DOMAIN_PW --command "start process te host $HOST database jepsen options '--ping-timeout $TIMEOUT --commit remote:3 --dba-user jepsen --dba-password jepsen --verbose error,warn,flush'"
done

You'll need to replace all the host names (n1, etc.) with the names you'll be using, and if your install path differs from the default, you'll need to change the NUODBMANAGER_JAR variable as well. The first loop sets up all the SMs. The first interesting bit of this is the commit mode --commit remote:3. What this means is that by default, transactions will acknowledge commits to clients only when responses have been received from 3 SMs (which is a majority). This means that by default, all writes against this database will be durable in the face of partitions. The second interesting bit is the ping timeout --ping-timeout $TIMEOUT. This sets the timeout in seconds of the heartbeat-based failure detection module. The details of this mechanism will be explained in detail in another blog post, but the gist of it is that after TIMEOUT seconds of no heartbeats from a peer, that peer is flagged as suspicious and eligible for removal from the database.

On the client machine, you'll need Java, maven and leiningen installed. Jepsen is a clojure application, so leiningen is definitely the way to go. Clojure runs on top of the JVM, and leiningen will fetch all the dependencies that jepsen needs. To prefetch all the jars and such, just run "lein deps" from the top-level directory of the jepsen project (the one with project.clj in it).

Running the Cotton-Pickin Thing

Once you've got the 6 machines setup and humming: log into the client machine, cd into the jepsen project directory and run "lein run nuodb -n 2000 -f partition -u <username> -p <passwd> -X insert". That will run the jepsen nuodb test. Here's a summary of the command line args:

  • -n controls the total number of writes
  • -f is the failure mode, 'partition' performs a partition during the test (and then heals it), 'noop' does nothing (useful for testing)
  • -u is the username of the user that will be ssh-ing into and running iptables on the server machines
  • -p is the password for the username specified in -u (needed for sudo)
  • -X insert makes jepsen treat an entire table as the result set rather than mutating a CLOB in a single-row table

Jepsen starts off just inserting integers into the logical set (in NuoDB, an actual table).

You should see something like this when starting up. After a little bit (a few hundred inserts or so), jepsen gets angry and will partition the database. In particular, it will use iptables rules to get n1 and n2 to silently drop all traffic from n3,4,5 (and vice-versa). This simulates a nefarious silent networking error, like a misbehaving switch. Since the OS won't detect failure, the NuoDB processes aren't going to get socket hangups or anything. What they will experience is buffers starting to fill up with unacknowledged messages. After partition, transactions executing against n1 and n2 won't be able to commit (remote:3 ensures that). Transactions executing against n3,4,5 will generate unsendable messages intended for n1 and n2 (which will pile up on n3-5), and those nodes will start pushing back against new transactions.

After 30 seconds of missed hearbeats, both partitions will start failure detection mechanisms that will decide on who to 'vote off the island'. The minority nodes (n1, n2) won't be able to form a quorum, and will nicely shut themselves down. The remaining nodes will have formed a quorum and will have excised n1 and n2 from their node sets. The unsent messages will be purged, and the pushed back transactions will start to drain through.

Timeouts

Jepsen tasks will time out. Tasks may block awaiting commit if they're in the minority partition and they're awaiting that third commit message that will never arrive. Tasks may block on the majority partition when pushback kicks in before failure detection has run its course. Additionally, tasks that were blocked on the minority partition will have to retry when the minority nodes shut down, and may timeout before they complete on the remaining nodes. In those cases, you'll see something like:

The writes above the timing out one took longer than desireable, but succeeded once the majority partition stopped pushing back.

Connection Failures

After the ping timeout elapses, the nodes in the minority partition will attempt to form a quorum to determine who's reachable. Because they can't, they'll elect to shut themselves down. Clients connected to those nodes will get a connection failure (a java IOException, to be precise). NuoDB recommended practice is for client applications that are doing their own connection management to retry in the case of connection failure. Many of the tasks that end up retrying may timeout because the timer isn't reset when the connection is retried. When the minority nodes shut themselves down, you'll see things like this in jepsen's output:

Results

Eventually, jepsen will heal the partition. By then the NuoDB nodes will have detected the partition, the minority nodes will have shut themselves down, the majority nodes will have stopped trying to contact the minority nodes, and all the backed up traffic will be draining through the surviving nodes. Once jepsen has finished attempting to execute all the tasks it set out to execute, it will summarize the results. For this run, the results are:

As you can see, no lost writes, even under partition and node failure, and just over 95% of the writes succeeded! We've run jepsen many times and this is not an unusual result. Successful write counts (out of a 2000 write run) will range from the mid 1800's to the mid 1900's. If the partition time is less than the ping timeout, NuoDB can 'ride out' the partition and all writes will succeed (however, this isn't the use case that jepsen is trying to exercise).

Next up, my colleague and aspiring cheese-maker Dan will set aside his cheddaring tools and dive into the details of the failure detection mechanism that jepsen is testing.


Published at DZone with permission of Seth Proctor, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)