NoSQL Zone is brought to you in partnership with:

Antoine works as a Technical Manager at MongoDB Inc, helping companies architect systems with MongoDB. Previously he worked as a kernel engineer on MongoDB core and the Java driver. He also has spent many years in the CDN industry, at Panther CDN then CDNetworks, designing and developing one of the largest and fastest Content Delivery and Application Acceleration Network. Prior work include the development of a new network protocol in Linux kernel to multiplex several interfaces, used in wireless devices. Antoine received a MS degree in Computer Science from Stevens Institute (Hoboken), and a BS in Computer Science from Epita (Paris France). Antoine is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone. You can read more from them at their website. View Full User Profile

MongoDB Indexing Tip #1: Find Your Friends' Recent Activity

10.31.2013
| 6729 views |
  • submit to reddit

Problem

A very common feature of any social site (and other types of application) is to find the latest activity of one’s friends. Another related example is to fetch the latest data for people or topics that one follows. You may think that databases would make it easy to implement such a feature, but unfortunately it is far from being easy.

Many new NoSQL stores do not really support secondary indexes, which are necessary for this feature. In the case of MongoDB it is possible to do it efficiently but it presents many pitfalls.

Fetching the Friends’ ids

The first step is to fetch the list of the user’s friends ids. The friendship relationship is typically either embedded into the user document (as a list of user ids) or in a mapping table (SQL style). In both cases it is assumed the application can quickly put together the list of friends’ ids.

The Ideal Query

The intuitive query should look like this:

db.posts.find({ userId: { $in: [12, 24, 56, …] } }).sort({ date: -1 }).limit(100)

Notes:

  • it is much better to use “$in“ rather than “$or“. The latter does not use indexes properly when combined with a sort.
  • the sort is for descending time, to obtain the most recent
  • the limit tells that we are only interested in the last 100 documents.

For such a query, the best index is a compound “{ userId: 1, date: 1}“. Note that it is not important whether it is ascending or descending on the date.

What is expected? Ideally, MongoDB will take the following steps:

  • go to each friend’s branch of the index.
  • pick the 100 most recent documents within each, which is fast thanks to the compound on “date“.
  • apply an in-memory sort on those documents (e.g. merge 1000 documents for 10 friends, which is fast) and then return the most recent 100.

The following diagram shows what should ideally happen.

image

The “Limit“ Problem

One debatable design decision was to not pass the “limit“ value into the query protocol. Instead only the “batchSize“ is communicated, which tells the server how many results to return in the next batch but omits to say that no more results are needed.

If you set a limit of 100, then the “batchSize“ gets set to 100 but the server assumes you may ask for more than 100 records, and keeps the cursor open on the server. In turn it makes it impossible for the server to reduce the amount of records sorted: it must sort in-memory *all* your friends’ documents *since the beginning of times*.

The fix is to use a negative limit: if set to -100 then it will be passed to the server as a negative “batchSize“ value. It tells the server: send me only 1 batch with a maximum of 100 documents and do not keep the cursor open. One caveat to this solution is that a batch cannot exceed 16MB, so the 100 documents must fit within 16MB (which is likely).

The Problem with MongoDB 2.0

With MongoDB 2.0 (which hopefully you’re not using anymore at this time) the behavior still won’t be the one expected. It will just continue to disregard the limit and always resort to sort all documents since day 1. As a consequence the queries will get slower and slower over time, which is a big no-no.

The following diagram shows the bad behavior.

image

Published at DZone with permission of Antoine Girbal, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)