I'm just starting out with mongo db and trying to make some simple things. I filled up my database with a collections of data containing the "item" property. I wanted to try to count how much time every item is in the collection
example of a document:
{ "_id" : ObjectId("50dadc38bbd7591082d920f0"), "item" : "Pons", "lines" : 37 }
So I designed these two functions for doing MapReduce (written in python using pymongo)
all_map = Code("function () {"
" emit(this.item, 1);"
"}")
all_reduce = Code("function (key, values) {"
" var sum = 0;"
" values.forEach(function(value){"
" sum += value;"
" });"
" return sum;"
"}")
This worked like a charm, so I began filling the collection. At around 30.000 documents, the mapreduce already lasts longer than a second... Because NoSQL is bragging about speed I thought I must have been doing something wrong!
A Question here at Stack Overflow made me check out the Aggregation feature of mongodb. So I tried to use the group + sum + sort thingies. Came up with this:
db.wikipedia.aggregate(
{ $group: { _id: "$item", count: { $sum: 1 } } },
{ $sort: {count: 1} }
)
This code works just fine and gives me the same results as the mapreduce set, but it is just as slow. Am I doing something wrong? Do I really need to use other tools like hadoop to get a better performance?
I will place an answer basically summing up my comments. I cannot speak for other techs like Hadoop since I have not yet had the pleasure of finding time to use them but I can speak for MongoDB.
Unfortunately you are using two of the worst operators for any database: computed fields and grouping (or distinct) on a full table scan. The aggregation framework in this case must compute the field, group and then in-memory ( http://docs.mongodb.org/manual/reference/aggregation/#_S_sort ) sort the computed field. This is an extremely inefficient task for MongoDB to perform, in fact most likely any database.
There is no easy way to do this in real-time in line to your own application. Map reduce could be a way out if you didn't need to return the results immediately but since I am guessing you don't really want to wait for this kind of stuff the default method is just to eradicate the group altogether.
You can do this by pre-aggregation. So you can create another collection of grouped_wikipedia
and in your application you manage this using an upsert()
with atomic operators like $set
and $inc
(to count the occurrences) to make sure you only get one row per item
. This is probably the most sane method of solving this problem.
This does however raise another problem of having to manage this extra collection alongside the detail collection wikipedia
but I believe this to be a unavoidable side effect of getting the right performance here. The benefits will be greater than the loss of having to manage the extra collection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With