NoSQL Zone is brought to you in partnership with:

I am a software architect passionate about software integration, high scalability and concurrency challenges. Vlad is a DZone MVB and is not an employee of DZone and has posted 70 posts at DZone. You can read more from them at their website. View Full User Profile

MongoDB Facts: Over 80,000 Inserts/Second on Commodity Hardware

12.04.2013
| 12463 views |
  • submit to reddit

While experimenting with some time series collections I needed a large data set to check that our aggregation queries don’t become a bottleneck in case of increasing data loads. We settled for 50 million documents, since beyond this number we would consider sharding anyway.

Each time event looks like this:

{
        "_id" : ObjectId("5298a5a03b3f4220588fe57c"),
        "created_on" : ISODate("2012-04-22T01:09:53Z"),
        "value" : 0.1647851116706831
}

As we wanted to get random values, we thought of generating them using JavaScript or Python (we could have tried it in Java, but we wanted to write it as fast as possible). We didn’t know which one will be faster so we decided to test them.

Our first try was with a JavaScript file run through the MongoDB shell.

Here is what it looks like:

var minDate = new Date(2012, 0, 1, 0, 0, 0, 0);
var maxDate = new Date(2013, 0, 1, 0, 0, 0, 0);
var delta = maxDate.getTime() - minDate.getTime();
 
var job_id = arg2;
 
var documentNumber = arg1;
var batchNumber = 5 * 1000;
 
var job_name = 'Job#' + job_id
var start = new Date();
 
var batchDocuments = new Array();
var index = 0;
 
while(index < documentNumber) {
    var date = new Date(minDate.getTime() + Math.random() * delta);
    var value = Math.random();
    var document = {       
        created_on : date,
        value : value
    };
    batchDocuments[index % batchNumber] = document;
    if((index + 1) % batchNumber == 0) {
        db.randomData.insert(batchDocuments);
    }
    index++;
    if(index % 100000 == 0) {  
        print(job_name + ' inserted ' + index + ' documents.');
    }
}
print(job_name + ' inserted ' + documentNumber + ' in ' + (new Date() - start)/1000.0 + 's');

This is how we run it and what we got:

mongo random --eval "var arg1=50000000;arg2=1" create_random.js
Job#1 inserted 100000 documents.
Job#1 inserted 200000 documents.
Job#1 inserted 300000 documents.
...
Job#1 inserted 49900000 documents.
Job#1 inserted 50000000 in 566.294s

Well, this is beyond my wild expectations already (88293 inserts/second).

Now it’s Python’s turn. You will need to install pymongo to properly run it.

import sys
import os
import pymongo
import time
import random
 
from datetime import datetime
 
min_date = datetime(2012, 1, 1)
max_date = datetime(2013, 1, 1)
delta = (max_date - min_date).total_seconds()
 
job_id = '1'
 
if len(sys.argv) < 2:
    sys.exit("You must supply the item_number argument")
elif len(sys.argv) > 2:
    job_id = sys.argv[2]   
 
documents_number = int(sys.argv[1])
batch_number = 5 * 1000;
 
job_name = 'Job#' + job_id
start = datetime.now();
 
# obtain a mongo connection
connection = pymongo.Connection("mongodb://localhost", safe=True)
 
# obtain a handle to the random database
db = connection.random
collection = db.randomData
 
batch_documents = [i for i in range(batch_number)];
 
for index in range(documents_number):
    try:           
        date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta)))
        value = random.random()
        document = {
            'created_on' : date,   
            'value' : value,   
        }
        batch_documents[index % batch_number] = document
        if (index + 1) % batch_number == 0:
            collection.insert(batch_documents)     
        index += 1;
        if index % 100000 == 0:
            print job_name, ' inserted ', index, ' documents.'     
    except:
        print 'Unexpected error:', sys.exc_info()[0], ', for index ', index
        raise
print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'

We run it and this is what we got this time:

python create_random.py 50000000
Job#1  inserted  100000  documents.
Job#1  inserted  200000  documents.
Job#1  inserted  300000  documents.
...
Job#1  inserted  49900000  documents.
Job#1  inserted  50000000  in  1713.501 s
Published at DZone with permission of Vlad Mihalcea, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

A. Jesse Jiryu Davis replied on Thu, 2013/12/05 - 6:34pm

This is a great post, thanks for the info. I was curious why single-process Python is so much slower than single-process Javascript, so I profiled your Python script and found it spent most of its CPU on the "fromtimestamp" line. I replaced it with this:

date = datetime.now()

That change increased throughput by 33%.

Vlad Mihalcea replied on Sat, 2013/12/07 - 6:17am in response to: A. Jesse Jiryu Davis

Thanks for the tip, I am just an occasional Python developer, since I use it more as an universal bash scripting tool. I wanted to distribute the entries between a start and an end date, so I can further calculate some time series. In this example I wanted to generate 50.000.000 values for an year period (2012-2013).

Michal Lorenc replied on Sun, 2014/09/14 - 5:25am

 The code runs almost 2x faster with PyPy http://mictadlo.tumblr.com/post/97461629023/pymongo-almost-2x-faster-with-pypy

Vlad Mihalcea replied on Sun, 2014/09/14 - 6:11am

 Great numbers. Thanks for sharing it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.