NoSQL Zone is brought to you in partnership with:

I am a software architect passionate about software integration, high scalability and concurrency challenges. Vlad is a DZone MVB and is not an employee of DZone and has posted 42 posts at DZone. You can read more from them at their website. View Full User Profile

MongoDB and the Fine Art of Data Modelling

01.28.2014
| 7731 views |
  • submit to reddit

Introduction

This is the third part of our MongoDB time series tutorial, and this post will emphasize the importance of data modelling. You might want to check the first part of this series, to get familiar with our virtual project requirements and the second part talking about common optimization techniques.

When you first start using MongoDB, you’ll immediately notice its schema-less data model. But schema-less doesn’t mean skipping proper data modelling (satisfying your application business and performance requirements). As opposed to a SQL database, a NoSQL document model is more focused towards querying than to data normalization. That’s why your design won’t be finished unless it addresses your data querying patterns.

The new data model

Our previous time event was modeled like this:

{
    "_id" : ObjectId("52cb898bed4bd6c24ae06a9e"),
    "created_on" : ISODate("2012-11-02T01:23:54.010Z")
    "value" : 0.19186609564349055
}

We concluded that the ObjectId is working against us, since its index size is about 1.4GB and our data aggregation logic doesn’t use it at all. The only true benefit, for having it, comes from the possibility of using bulk inserts.

The previous solution was using a Date field for storing the event creation timestamp. This affected the aggregation grouping logic, which ended up with the following structure:

"_id" : {
    "year" : {
        "$year" : [
            "$created_on"
        ]
    },
    "dayOfYear" : {
        "$dayOfYear" : [
            "$created_on"
        ]
    },
    "hour" : {
        "$hour" : [
            "$created_on"
        ]
    },
    "minute" : {
        "$minute" : [
            "$created_on"
        ]
    },
    "second" : {
        "$second" : [
            "$created_on"
        ]
    }
}

This group _id requires some application logic for obtaining a proper JSON Date. We can also change the created_on Date field to a numeric value, representing the number of milliseconds since Unix epoch. This can become our new document _id (which is indexed by default anyway).

This is how our new document structure will look like:

{
        "_id" : 1346895603146,
        "values" : [ 0.3992688732687384 ]
}
{
        "_id" : 1348436178673,
        "values" : [
                0.7518879524432123,
                0.0017396819312125444
        ]
}

Now, we can easily extract a timestamp reference (pointing to the current second, minute, hour or day) from a Unix timestamp.

So, if the current timestamp is 1346895603146 (Thu, 06 Sep 2012 01:40:03 146ms GMT), we can extract:

- the current second time point [Thu, 06 Sep 2012 01:40:03 GMT]: 1346895603000 = (1346895603146 – (1346895603146 % 1000))
- the current minute time point [Thu, 06 Sep 2012 01:40:00 GMT] : 1346895600000 = (1346895603146 – (1346895603146 % (60 * 1000)))
- the current hour time point [Thu, 06 Sep 2012 01:00:00 GMT] : 1346893200000 = (1346895603146 – (1346895603146 % (60 * 60 * 1000)))
- the current day time point [Thu, 06 Sep 2012 00:00:00 GMT] : 1346889600000= (1346895603146 – (1346895603146 % (24 * 60 * 60 * 1000)))

The algorithm is quite straightforward and we can employ it when calculating the aggregation group identifier.

This new data model allows us to have one document per timestamp. Each time-event appends a new value to the “values” array, so two events, happening at the very same instant, will share the same MongoDB document.

Inserting test data

All these changes require altering the import script we’ve used previously. This time we can’t use a batch insert, and we will take a more real-life approach. This time, we’ll use a non-batched upsert like in the following script:

var minDate = new Date(2012, 0, 1, 0, 0, 0, 0);
var maxDate = new Date(2013, 0, 1, 0, 0, 0, 0);
var delta = maxDate.getTime() - minDate.getTime();
 
var job_id = arg2;
 
var documentNumber = arg1;
var batchNumber = 5 * 1000;
 
var job_name = 'Job#' + job_id
var start = new Date();
 
var index = 0;
 
while(index < documentNumber) {
    var date = new Date(minDate.getTime() + Math.random() * delta);
    var value = Math.random(); 
    db.randomData.update( { _id: date.getTime() }, { $push: { values: value } }, true );   
    index++;
    if(index % 100000 == 0) {  
        print(job_name + ' inserted ' + index + ' documents.');
    }
}
print(job_name + ' inserted ' + documentNumber + ' in ' + (new Date() - start)/1000.0 + 's');

Now it’s time to insert the 50M documents.

Job#1 inserted 49900000 documents.
Job#1 inserted 50000000 documents.
Job#1 inserted 50000000 in 4265.45s

Inserting 50M entries is slower than the previous version, but we can still get 10k inserts per second without any write optimization. For the purpose of this test, we will assume that 10 events per millisecond is enough, considering that at such rate we will eventually have 315 billion documents a year.

Compacting data

Now, let’s check the new collection stats:

db.randomData.stats();
{
        "ns" : "random.randomData",
        "count" : 49709803,
        "size" : 2190722612,
        "avgObjSize" : 44.070233229449734,
        "storageSize" : 3582234624,
        "numExtents" : 24,
        "nindexes" : 1,
        "lastExtentSize" : 931495936,
        "paddingFactor" : 1.0000000000429572,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 1853270272,
        "indexSizes" : {
                "_id_" : 1853270272
        },
        "ok" : 1
}

The document size has reduced from 64 to 44 bytes, and this time we only have one index. We can reduce the collection size even further if using the compact command.

db.randomData.runCommand("compact");
{
        "ns" : "random.randomData",
        "count" : 49709803,
        "size" : 2190709456,
        "avgObjSize" : 44.06996857340191,
        "storageSize" : 3267653632,
        "numExtents" : 23,
        "nindexes" : 1,
        "lastExtentSize" : 851263488,
        "paddingFactor" : 1.0000000000429572,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 1250568256,
        "indexSizes" : {
                "_id_" : 1250568256
        },
        "ok" : 1
}
Published at DZone with permission of Vlad Mihalcea, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)