NoSQL Zone is brought to you in partnership with:

Ayende Rahien is working for Hibernating Rhinos LTD, a Israeli based company producing developer productivity tools for OLTP applications such as NHibernate Profiler (nhprof.com), Linq to SQL Profiler(l2sprof.com), Entity Framework Profiler (efprof.com) and more. Ayende is a DZone MVB and is not an employee of DZone and has posted 457 posts at DZone. You can read more from them at their website. View Full User Profile

Reviewing LevelDB, Part II

03.22.2013
| 2177 views |
  • submit to reddit

I think that the very first thing that we want to do is to actually discover how exactly is leveldb saving the information to disk. In order to do that, we are going to trace the calls (with commentary) for the Put method.

We start from the client code:

    leveldb::DB* db;

    leveldb::DB::Open(options, "play/testdb", &db);

    status = db->Put(leveldb::WriteOptions(), "Key", "Hello World");

This calls the following method:

    // Default implementations of convenience methods that subclasses of DB

    // can call if they wish

    Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) {

      WriteBatch batch;

      batch.Put(key, value);

      return Write(opt, &batch);

    }

     

    Status DB::Delete(const WriteOptions& opt, const Slice& key) {

     WriteBatch batch;

     batch.Delete(key);

     return Write(opt, &batch);

   }

I included the Delete method as well, because this code teaches us something important, all the modifications calls are always going through the same WriteBatch call. Let us look at that now.

    Status DBImpl::Write(const WriteOptions& options, WriteBatch* my_batch) {

      Writer w(&mutex_);

      w.batch = my_batch;

      w.sync = options.sync;

      w.done = false;

     

      MutexLock l(&mutex_);

      writers_.push_back(&w);

      while (!w.done && &w != writers_.front()) {

       w.cv.Wait();

     }

     if (w.done) {

       return w.status;

     }

    

     // May temporarily unlock and wait.

     Status status = MakeRoomForWrite(my_batch == NULL);

     uint64_t last_sequence = versions_->LastSequence();

     Writer* last_writer = &w;

     if (status.ok() && my_batch != NULL) {  // NULL batch is for compactions

       WriteBatch* updates = BuildBatchGroup(&last_writer);

       WriteBatchInternal::SetSequence(updates, last_sequence + 1);

       last_sequence += WriteBatchInternal::Count(updates);

    

       // Add to log and apply to memtable.  We can release the lock

       // during this phase since &w is currently responsible for logging

       // and protects against concurrent loggers and concurrent writes

       // into mem_.

       {

         mutex_.Unlock();

         status = log_->AddRecord(WriteBatchInternal::Contents(updates));

         if (status.ok() && options.sync) {

           status = logfile_->Sync();

         }

         if (status.ok()) {

           status = WriteBatchInternal::InsertInto(updates, mem_);

         }

         mutex_.Lock();

       }

       if (updates == tmp_batch_) tmp_batch_->Clear();

    

       versions_->SetLastSequence(last_sequence);

     }

    

     while (true) {

       Writer* ready = writers_.front();

       writers_.pop_front();

       if (ready != &w) {

         ready->status = status;

         ready->done = true;

         ready->cv.Signal();

       }

       if (ready == last_writer) break;

     }

    

     // Notify new head of write queue

     if (!writers_.empty()) {

       writers_.front()->cv.Signal();

     }

     return status;

   }

Now we have a lot of code to go through. Let us see what conclusions we can draw from this.

The first 15 lines or so seems to create a new Writer, not sure what that is yet, and register that in a class variable. Maybe it is actually being written on a separate thread?

I am going to switch over and look at that line of thinking .First thing to do is to look at the Writer implementation. This writer looks like this:

        struct DBImpl::Writer {

          Status status;

          WriteBatch* batch;

          bool sync;

          bool done;

          port::CondVar cv;

         

          explicit Writer(port::Mutex* mu) : cv(mu) { }

        };

So this is just a data structure with no behavior. Note that we have CondVar, whatever that is. Which accepts a mutex. Following the code, we see this is a pthread condition variable. I haven’t dug too deep into this, but it appears like it is similar to the .NET lock variable. Except that there seems to be the ability to associate multiple variables with a single mutex. Which could be a useful way to signal on specific conditions. The basic idea is that you can wait for a specific operation, not just a single variable.

Now that I get that, let us see what we can figure out about the writers_ usage. This is just a standard (non thread safe) std::deque, (a data structure merging properties of list & queue). Thread safety is achieved via the call to MutexLock on line 7. I am going to continue ignoring the rest of the function and look where else this value is being used now. Back now, and it appears that the only place where writers_ are used in in this method or methods that it calls.

What this means in turn is that unlike what I thought, there isn’t a dedicated background thread for this operation. Rather, this is a way for leveldb to serialize access. As I understand it. Calls to the Write() method would block on the mutex access, then it waits until its write is the current one (that is what the &w != writers_.front() means. Although the code also seems to suggest that another thread may pick up on this behavior and batch multiple writes to disk at the same time. We will discuss this later on.

Right now, let us move to line 17, and MakeRoomForWrite. This appears to try to make sure that we have enough room to the next write. I don’t really follow the code there yet, I’ll ignore that for now and move on to the rest of the Write() method.

In line 18, we get the current sequence number, although I am not sure why that is, I think it is possible this is for the log. The next interesting bit is in BuildBatchGroup, this method will merge existing pending writes into one big write (but not too big a write). This is a really nice way to merge a lot of IO into a single disk access, without introducing latency in the common case.

The rest of the code is dealing with the actual write to the log  / mem table 20 – 45, then updating the status of the other writers we might have modified, as well as starting the writes for existing writers that may have not got into the current batch.

And I think that this is enough for now. We haven’t got to disk yet, I admit, but we did get a lot of stuff done. On my next post, I’ll dig even deeper, and try to see how the data is actually structured, I think that this would be interesting…



Published at DZone with permission of Ayende Rahien, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)