Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB, MapReduce and sorting

I might be a bit in over my head on this as I'm still learning the ins and outs of MongoDB, but here goes.

Right now I'm working on a tool to search/filter through a dataset, sort it by an arbitrary datapoint (eg. popularity) and then group it by an id. The only way I see I can do this is through Mongo's MapReduce functionality.

I can't use .group() because I'm working with more than 10,000 keys and I also need to be able to sort the dataset.

My MapReduce code is working just fine, except for one thing: sorting. Sorting just doesn't want to work at all.

db.runCommand({
  'mapreduce': 'products',
  'map': function() {
    emit({
      product_id: this.product_id,
      popularity: this.popularity
    }, 1);
  },
  'reduce': function(key, values) {
    var sum = 0;
    values.forEach(function(v) {
      sum += v;
    });

    return sum;  
  },
  'query': {category_id: 20},
  'out': {inline: 1},
  'sort': {popularity: -1}
});

I already have a descending index on the popularity datapoint, so it's definitely not working because of a lack of that:

{ 
  "v" : 1, 
  "key" : { "popularity" : -1 }, 
  "ns" : "app.products", 
  "name" : "popularity_-1" 
}

I just cannot figure out why it doesn't want to sort.

Instead of inlining the result set, I can't output it to another collection and then run a .find().sort({popularity: -1}) on that because of the way this feature is going to work.

like image 567
Jon Ursenbach Avatar asked Aug 18 '12 01:08

Jon Ursenbach


2 Answers

First of all, Mongo map/reduce are not designed to be used in as a query tool (as it is in CouchDB), it is design for you to run background tasks. I use it at work to analyze traffic data.

What you are doing wrong however is that you're applying the sort() to your input, but it is useless because when the map() stage is done the intermediate documents are sorted by each keys. Because your key is a document, it is being sort by product_id, popularity.

This is how I generated my dataset

function generate_dummy_data() {
    for (i=2; i < 1000000; i++) { 
        db.foobar.save({
          _id: i, 
         category_id: parseInt(Math.random() * 30), 
         popularity:    parseInt(Math.random() * 50)
        }) 
    }
}

And this my map/reduce task:

var data = db.runCommand({
  'mapreduce': 'foobar',
  'map': function() {
    emit({
      sorting: this.popularity * -1,
      product_id: this._id,
      popularity: this.popularity,
    }, 1);
  },
  'reduce': function(key, values) {
    var sum = 0;
    values.forEach(function(v) {
      sum += v;
    });

    return sum;  
  },
  'query': {category_id: 20},
  'out': {inline: 1},
});

And this is the end result (very long to paste it here):

http://cesarodas.com/results.txt

This works because now we're sorting by sorting, product_id, popularity. You can play with the sorting how ever you like just remember that the final sorting is by key regardless of you how your input is sorted.

Anyway as I said before you should avoid doing queries with Map/Reduce it was designed for background processing. If I were you I would design my data in such a way I could access it with simple queries, there is always a trade-off in this case complex insert/updates to have simple queries (that's how I see MongoDB).

like image 178
crodas Avatar answered Nov 02 '22 06:11

crodas


As noted in discussion on the original question:

  • Map/Reduce with inline output currently cannot use an explicit sort key (see SERVER-3973). Possible workarounds include relying on the emitted key order (see @crodas's answer); outputting to a collection and querying that collection with sort order; or sorting the results in your application using something like usort().

  • OP's preference is for inline results rather than creating/deleting temporary collections.

  • The Aggregation Framework in MongoDB 2.2 (currently a production release candidate) would provide a suitable solution.

Here's an example of a similar query to the original Map/Reduce, but instead using the Aggregation Framework:

db.products.aggregate(
  { $match: { category_id: 20 }},
  { $group : {
     _id : "$product_id",
     'popularity' : { $sum : "$popularity" },
  }},
  { $sort: { 'popularity': -1 }}
)

.. and sample output:

{
    "result" : [
        {
            "_id" : 50,
            "popularity" : 139
        },
        {
            "_id" : 150,
            "popularity" : 99
        },
        {
            "_id" : 123,
            "popularity" : 55
        }
    ],
    "ok" : 1
}
like image 30
Stennie Avatar answered Nov 02 '22 06:11

Stennie