Logo Questions Linux Laravel Mysql Ubuntu Git Menu

MongoDB incremental mapReduce, select only new documents, added after last mapReduce

Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):

> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }

Now I need to collect some statistics on that collection. For example:

        var total = 0;
        for(i in values) {total+=values[i]};
        return total;

will collect totals in 'stat' collection.

> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }

At this point everything is perfect, but I've stuck on the next move:

  1. 'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
  2. I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
  3. It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
  4. To do this I should be able to query only documents, that was added after my last mapReduce
  5. As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
  6. I can change ObjectId generator, but not sure how to do it better in sharded environment

So the question is:

Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?

like image 629
Hitosu Avatar asked Oct 17 '11 13:10


Video Answer

2 Answers

You can cache the time and use it as a barrier for your next incremental map-reduce.

We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)

We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):

m = Code(<map function>)
r = Code(<reduce function>)

# pseudo code
end = last_time + 5 minutes

# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])

collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)

Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.

like image 53
Xavier Ho Avatar answered Nov 16 '22 02:11

Xavier Ho

You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.

EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/

like image 44
Chris Shain Avatar answered Nov 16 '22 03:11

Chris Shain