I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long. I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows. <pre class="prettyprint"><code>CREATE TABLE `profile_views` ( `id` int(10) unsigned NOT NULL auto_increment, `username` varchar(20) NOT NULL, `day` date NOT NULL, `views` int(10) unsigned default '0', PRIMARY KEY (`id`), UNIQUE KEY `username` (`username`,`day`), KEY `day` (`day`) ) ENGINE=InnoDB; </code></pre> Typical data might look like this. <pre class="prettyprint"><code>+--------+----------+------------+------+ | id | username | day | hits | +--------+----------+------------+------+ | 650001 | Joe | 2010-07-10 | 1 | | 650002 | Jane | 2010-07-10 | 2 | | 650003 | Jack | 2010-07-10 | 3 | | 650004 | Jerry | 2010-07-10 | 4 | +--------+----------+------------+------+ </code></pre> I use this query to get the top 5 most viewed profiles since 2010-07-16. <pre class="prettyprint"><code>SELECT username, SUM(hits) FROM profile_views WHERE day > '2010-07-16' GROUP BY username ORDER BY hits DESC LIMIT 5\G </code></pre> This query completes in under a minute. Not bad! Now moving onto the world of MongoDB. I setup a sharded environment using 3 servers. Servers M, S1, and S2. I used the following commands to set the rig up (Note: I've obscured the IP addys). <pre class="prettyprint"><code>S1 => 127.20.90.1 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log S2 => 127.20.90.7 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log M => 127.20.4.1 ./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log ./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog </code></pre> Once those were up and running, I hopped on server M, and launched mongo. I issued the following commands: <pre class="prettyprint"><code>use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } ); use profiles db.views.ensureIndex({ hits: -1 }); </code></pre> I then imported the same 10,000,000 rows from MySQL, which gave me documents that look like this: <pre class="prettyprint"><code>{ "_id" : ObjectId("4cb8fc285582125055295600"), "username" : "Joe", "day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)", "hits" : 16 } </code></pre> Now comes the real meat and potatoes here... My map and reduce functions. Back on server M in the shell I setup the query and execute it like this. <pre class="prettyprint"><code>use profiles; var start = new Date(2010, 7, 16); var map = function() { emit(this.username, this.hits); } var reduce = function(key, values) { var sum = 0; for(var i in values) sum += values[i]; return sum; } res = db.views.mapReduce( map, reduce, { query : { day: { $gt: start }} } ); </code></pre> And here's were I run into problems. This query took over 15 minutes to complete! The MySQL query took under a minute. Here's the output: <pre class="prettyprint"><code>{ "result" : "tmp.mr.mapreduce_1287207199_6", "shardCounts" : { "127.20.90.7:10000" : { "input" : 4917653, "emit" : 4917653, "output" : 1105648 }, "127.20.90.1:10000" : { "input" : 5082347, "emit" : 5082347, "output" : 1150547 } }, "counts" : { "emit" : NumberLong(10000000), "input" : NumberLong(10000000), "output" : NumberLong(2256195) }, "ok" : 1, "timeMillis" : 811207, "timing" : { "shards" : 651467, "final" : 159740 }, } </code></pre> Not only did it take forever to run, but the results don't even seem to be correct. <pre class="prettyprint"><code>db[res.result].find().sort({ hits: -1 }).limit(5); { "_id" : "Joe", "value" : 128 } { "_id" : "Jane", "value" : 2 } { "_id" : "Jerry", "value" : 2 } { "_id" : "Jack", "value" : 2 } { "_id" : "Jessy", "value" : 3 } </code></pre> I know those value numbers should be much higher. My understanding of the whole MapReduce paradigm is the task of performing this query should be split between all shard members, which should increase performance. I waited till Mongo was done distributing the documents between the two shard servers after the import. Each had almost exactly 5,000,000 documents when I started this query. So I must be doing something wrong. Can anyone give me any pointers? Edit: Someone on IRC mentioned adding an index on the day field, but as far as I can tell that was done automatically by MongoDB.

Maybe I'm too late, but... First, you are querying the collection to fill the MapReduce without an index. You shoud create an index on "day". MongoDB MapReduce is single threaded on a single server, but parallelizes on shards. The data in mongo shards are kept together in contiguous chunks sorted by sharding key. As your sharding key is "day", and you are querying on it, you probably are only using one of your three servers. Sharding key is only used to spread the data. Map Reduce will query using the "day" index on each shard, and will be very fast. Add something in front of the day key to spread the data. The username can be a good choice. That way the Map reduce will be launched on all servers and hopefully reducing the time by three. Something like this: <pre class="prettyprint"><code>use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } ); use profiles db.views.ensureIndex({ hits: -1 }); db.views.ensureIndex({ day: -1 }); </code></pre> I think with those additions, you can match MySQL speed, even faster. Also, better don't use it real time. If your data don't need to be "minutely" precise, shedule a map reduce task every now an then and use the result collection.

MongoDB: Terrible MapReduce Performance

Tags:

mongodb

nosql

mapreduce

I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long.

I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows.

CREATE TABLE `profile_views` (   `id` int(10) unsigned NOT NULL auto_increment,   `username` varchar(20) NOT NULL,   `day` date NOT NULL,   `views` int(10) unsigned default '0',   PRIMARY KEY  (`id`),   UNIQUE KEY `username` (`username`,`day`),   KEY `day` (`day`) ) ENGINE=InnoDB;

Typical data might look like this.

+--------+----------+------------+------+ | id     | username | day        | hits | +--------+----------+------------+------+ | 650001 | Joe      | 2010-07-10 |    1 | | 650002 | Jane     | 2010-07-10 |    2 | | 650003 | Jack     | 2010-07-10 |    3 | | 650004 | Jerry    | 2010-07-10 |    4 | +--------+----------+------------+------+

I use this query to get the top 5 most viewed profiles since 2010-07-16.

SELECT username, SUM(hits) FROM profile_views WHERE day > '2010-07-16' GROUP BY username ORDER BY hits DESC LIMIT 5\G

This query completes in under a minute. Not bad!

Now moving onto the world of MongoDB. I setup a sharded environment using 3 servers. Servers M, S1, and S2. I used the following commands to set the rig up (Note: I've obscured the IP addys).

S1 => 127.20.90.1 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log  S2 => 127.20.90.7 ./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log  M => 127.20.4.1 ./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log ./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog

Once those were up and running, I hopped on server M, and launched mongo. I issued the following commands:

use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } ); use profiles db.views.ensureIndex({ hits: -1 });

I then imported the same 10,000,000 rows from MySQL, which gave me documents that look like this:

{     "_id" : ObjectId("4cb8fc285582125055295600"),     "username" : "Joe",     "day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)",     "hits" : 16 }

Now comes the real meat and potatoes here... My map and reduce functions. Back on server M in the shell I setup the query and execute it like this.

use profiles; var start = new Date(2010, 7, 16); var map = function() {     emit(this.username, this.hits); } var reduce = function(key, values) {     var sum = 0;     for(var i in values) sum += values[i];     return sum; } res = db.views.mapReduce(     map,     reduce,     {         query : { day: { $gt: start }}     } );

And here's were I run into problems. This query took over 15 minutes to complete! The MySQL query took under a minute. Here's the output:

{         "result" : "tmp.mr.mapreduce_1287207199_6",         "shardCounts" : {                 "127.20.90.7:10000" : {                         "input" : 4917653,                         "emit" : 4917653,                         "output" : 1105648                 },                 "127.20.90.1:10000" : {                         "input" : 5082347,                         "emit" : 5082347,                         "output" : 1150547                 }         },         "counts" : {                 "emit" : NumberLong(10000000),                 "input" : NumberLong(10000000),                 "output" : NumberLong(2256195)         },         "ok" : 1,         "timeMillis" : 811207,         "timing" : {                 "shards" : 651467,                 "final" : 159740         }, }

Not only did it take forever to run, but the results don't even seem to be correct.

db[res.result].find().sort({ hits: -1 }).limit(5); { "_id" : "Joe", "value" : 128 } { "_id" : "Jane", "value" : 2 } { "_id" : "Jerry", "value" : 2 } { "_id" : "Jack", "value" : 2 } { "_id" : "Jessy", "value" : 3 }

I know those value numbers should be much higher.

My understanding of the whole MapReduce paradigm is the task of performing this query should be split between all shard members, which should increase performance. I waited till Mongo was done distributing the documents between the two shard servers after the import. Each had almost exactly 5,000,000 documents when I started this query.

So I must be doing something wrong. Can anyone give me any pointers?

Edit: Someone on IRC mentioned adding an index on the day field, but as far as I can tell that was done automatically by MongoDB.

259

asked Oct 16 '10 06:10

mellowsoon

2 Answers

excerpts from MongoDB Definitive Guide from O'Reilly:

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in “real time.” You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

options for map/reduce:  "keeptemp" : boolean  If the temporary result collection should be saved when the connection is closed.   "output" : string  Name for the output collection. Setting this option implies keeptemp : true.

112

answered Sep 26 '22 04:09

nonopolarity

Maybe I'm too late, but...

First, you are querying the collection to fill the MapReduce without an index. You shoud create an index on "day".

MongoDB MapReduce is single threaded on a single server, but parallelizes on shards. The data in mongo shards are kept together in contiguous chunks sorted by sharding key.

As your sharding key is "day", and you are querying on it, you probably are only using one of your three servers. Sharding key is only used to spread the data. Map Reduce will query using the "day" index on each shard, and will be very fast.

Add something in front of the day key to spread the data. The username can be a good choice.

That way the Map reduce will be launched on all servers and hopefully reducing the time by three.

Something like this:

use admin db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } ); db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } ); db.runCommand( { enablesharding : "profiles" } ); db.runCommand( { shardcollection : "profiles.views", key : {username : 1,day: 1} } ); use profiles db.views.ensureIndex({ hits: -1 }); db.views.ensureIndex({ day: -1 });

I think with those additions, you can match MySQL speed, even faster.

Also, better don't use it real time. If your data don't need to be "minutely" precise, shedule a map reduce task every now an then and use the result collection.

answered Sep 24 '22 04:09

FrameGrace

Related questions
                            
                                How to insert a doc into mongodb using mongoose and get the generated id?
                            
                                MongoDB bind_ip won't work unless set to 0.0.0.0
                            
                                Adding/Subtracting days to ISODate in MongoDB Shell
                            
                                How to update date field in mongo console?
                            
                                How to create user in mongodb with docker-compose
                            
                                MongoDB: Find Subdocument in Array Matching Parameters
                            
                                MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution
                            
                                Querying internal array size in MongoDB
                            
                                MongoDB mongoose Deprecation Warning
                            
                                pymongo- How can I have distinct values for a field along with other query parameters
                            
                                Get All 'documents' from MongoDB 'collection'
                            
                                MongoDB running but can't connect using shell
                            
                                Get "data from collection b not in collection a" in a MongoDB shell query
                            
                                How do I perform an id array query in Mongoose? [duplicate]
                            
                                Can I just kill mongod to stop mongo?
                            
                                MongoDB - Error: invalid schema, expected mongodb
                            
                                Node.js and Mongoose regex query on multiple fields
                            
                                Subdocument index in mongo
                            
                                "db.createUser is not a function" and "password can't be empty"
                            
                                Difference between count() and find().count() in MongoDB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With