NoSQL Zone is brought to you in partnership with:

I am a software architect passionate about software integration, high scalability and concurrency challenges. Vlad is a DZone MVB and is not an employee of DZone and has posted 74 posts at DZone. You can read more from them at their website. View Full User Profile

MongoDB Time Series: Introducing the Aggregation Framework

01.14.2014
| 17936 views |
  • submit to reddit

In my previous posts, I talked about batch importing and the out-of-the-box MongoDB performance. Meanwhile, MongoDB was awarded DBMS of the year, so I therefore decided to offer a more thorough analysis of its real-life usage.

Because theory is better understood in a pragmatic context, I will first present you our virtual project requirements.

Introduction

Our virtual project has the following requirements:

  1. it must store valued time events represented as v=f(t)
  2. it must aggregate the minimum, maximum, average and count records by:
    • seconds in a minute
    • minutes in an hour
    • hours in a day
    • days in a year
  3. the seconds in a minute aggregation is calculated in real-time (so it must be really fast)
  4. all other aggregations are calculated by a batch processor (so they must be relatively fast)

Data model

I will offer two data modelling variants, each one having pros and cons:

1. The first version uses the default auto-assigned MongoDB “_id”, and this simplifies inserts, since we can do it in batches without fearing of any timestamp clashing. If there are 10 values recorded each millisecond, then we will end up having 10 distinct documents. This post will discuss this data model option.

{
    "_id" : ObjectId("52cb898bed4bd6c24ae06a9e"),
    "created_on" : ISODate("2012-11-02T01:23:54.010Z")
    "value" : 0.19186609564349055
}

2. The second version uses the number of milliseconds since epoch as the “_id” field and the values are stored inside a “values” array. If there are 10 values recorded each millisecond, then we will end up having one distinct document with 10 entries in the “values” array. A future post will be dedicated to this compacted data model.

{
        "_id" : 1348436178673,
        "values" : [
                0.7518879524432123,
                0.0017396819312125444
        ]
}

Inserting data

Like in my previous post I will use 50M documents for testing the aggregation logic. I chose this number because I am testing on my commodity PC. In the aforementioned post I managed to insert over 80000 documents per second. This time I will take a more real-life approach and start by creating the collection and the indexes prior to inserting the data.

MongoDB shell version: 2.4.6
connecting to: random
> db.dropDatabase()
{ "dropped" : "random", "ok" : 1 }
> db.createCollection("randomData");
{ "ok" : 1 }
> db.randomData.ensureIndex({"created_on" : 1});
> db.randomData.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "random.randomData",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "created_on" : 1
                },
                "ns" : "random.randomData",
                "name" : "created_on_1"
        }
]

Now it’s time to insert the 50M documents.

mongo random --eval "var arg1=50000000;arg2=1" create_random.js
...
Job#1 inserted 49900000 documents.
Job#1 inserted 50000000 in 2852.56s

This time we managed to import 17500 documents per second. At such rate we would require 550B entries a year, which is more than enough for our use case.

Compacting data

First, we need to analyze our collection statistics and for this we need to use the stats command:

db.randomData.stats()
{
        "ns" : "random.randomData",
        "count" : 50000000,
        "size" : 3200000096,
        "avgObjSize" : 64.00000192,
        "storageSize" : 5297451008,
        "numExtents" : 23,
        "nindexes" : 2,
        "lastExtentSize" : 1378918400,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 3497651920,
        "indexSizes" : {
                "_id_" : 1623442912,
                "created_on_1" : 1874209008
        },
        "ok" : 1
}

The current index size is almost 3.5GB and this is almost half of my available RAM. Luckily, MongoDB comes with a compact command, which we can use to defragment our data. This takes a lot of time, especially because we have a large total index size.

db.randomData.runCommand("compact");
Compacting took 1523.085s

Let’s see how much space we saved through compacting:

db.randomData.stats()
{
        "ns" : "random.randomData",
        "count" : 50000000,
        "size" : 3200000032,
        "avgObjSize" : 64.00000064,
        "storageSize" : 4415811584,
        "numExtents" : 24,
        "nindexes" : 2,
        "lastExtentSize" : 1149206528,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 2717890448,
        "indexSizes" : {
                "_id_" : 1460021024,
                "created_on_1" : 1257869424
        },
        "ok" : 1
}

We freed almost 800MB of data and that’s going to be handy for our RAM-intensive aggregation operations.

Published at DZone with permission of Vlad Mihalcea, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)