I have a MongoDB collection, whose docs use several levels of nesting, from which I would like to extract a multidimensional array compiled from a subset of their fields. I have a solution that works for me right now, but I want to better understand this concept of 'idempotency' and its consequences related to the reduce function. <pre class="prettyprint"><code>{ "host_name" : "gateway", "service_description" : "PING", "last_update" : 1305777787, "performance_object" : [ [ "rta", 0.105, "ms", 100, 500, 0 ], [ "pl", 0, "%", 20, 60, 0 ] ] } </code></pre> And here are the map/reduce functions <pre class="prettyprint"><code>var M = function() { var hn = this.host_name, sv = this.service_description, ts = this.last_update; this.performance_object.forEach(function(P){ emit( { host: hn, service: sv, metric: P[0] }, { time: ts, value: P[1] } ); }); } var R = function(key,values) { var result = { time: [], value: [] }; values.forEach(function(V){ result.time.push(V.time); result.value.push(V.value); }); return result; } db.runCommand({ mapreduce: <colname>, out: <col2name>, map: M, reduce: R }); </code></pre> Data is returned in a useful structure, which I reformat/sort with finalize for graphing. <pre class="prettyprint"><code>{ "_id" : { "host" : "localhost", "service" : "Disk Space", "metric" : "/var/bck" }, "value" : { "time" : [ [ 1306719302, 1306719601, 1306719903, ... ], [ 1306736404, 1306736703, 1306737002, ... ], [ 1306766401, 1306766701, 1306767001, ... ] ], "value" : [ [ 122, 23423, 25654, ... ], [ 336114, 342511, 349067, ... ], [ 551196, 551196, 551196, ... ] ] } } </code></pre> Finally... <pre class="prettyprint"><code> [ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ] </code></pre> TL;DR: What is the expected behavior with the oberved "chunking" of the array results? I understand that the reduce function may be called multiple times on array(s) of emitted values, which is why there are several "chunks" of the complete arrays, rather than a single array. The array chunks are typically 25-50 items and it's easy enough to clean this up in finalize(). I concat() the arrays, interleave them as [time,value] and sort. But what I really want to know is if this can get more complex: 1) Is the chunking observed because of my code, MongoDB's implementation or the Map/Reduce algorithm itself? 2) Will there ever be deeper (recursive) nesting of array chunks in sharded configurations or even just because of my hasty implementation? This would break the concat() method. 3) Is there simply a better strategy for getting array results as shown above? <h3>EDIT: Modified to emit arrays:</h3> I took Thomas' advise and re-wrote it to emit arrays. It absolutely doesn't make any sense to split up the values. <pre class="prettyprint"><code>var M = function() { var hn = this.host_name, sv = this.service_description, ts = this.last_update; this.performance_object.forEach(function(P){ emit( { host: hn, service: sv, metric: P[0] }, { value: [ ts, P[1] ] } ); }); } var R = function(key,values) { var result = { value: [] }; values.forEach(function(V){ result.value.push(V.value); }); return result; } db.runCommand({ mapreduce: <colname>, out: <col2name>, map: M, reduce: R }); </code></pre> Now the output is similar to this: <pre class="prettyprint"><code>{ "_id" : { "host" : "localhost", "service" : "Disk Space", "metric" : "/var/bck" }, "value" : { "value" : [ [ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ], [ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ], [ [1306719302,122],[1306719601,122],[1306719903,122], ... ] ] } } </code></pre> And I used this finalize function to concatenate the array chunks and sort them. <pre class="prettyprint"><code>... var F = function(key,values) { return (Array.concat.apply([],values.value)).sort(function(a,b){ if (a[0] < b[0]) return -1; if (a[0] > b[0]) return 1; return 0; }); } db.runCommand({ mapreduce: <colname>, out: <col2name>, map: M, reduce: R, finalize: F }); </code></pre> Which works nicely: <pre class="prettyprint"><code>{ "_id" : { "host" : "localhost", "service" : "Disk Space", "metric" : "/mnt/bck" }, "value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ] } </code></pre> I guess the only question that's gnawing at me is whether this Array.concat.apply([],values.value) can be trusted to clean up the output of reduce all of the time. <h3>LAST EDIT: Much simpler...</h3> I have modified the document structure since the original example given above, but this only changes the example by making the map function really simple. I'm still trying to wrap my brain around why Array.prototype.push.apply(result, V.data) works so differently from result.push(V.data)... but it works. <pre class="prettyprint"><code>var M = function() { emit( { host: this.host, service: this.service, metric: this.metric } , { data: [ [ this.timestamp, this.data ] ] } ); } var R = function(key,values) { var result = []; values.forEach(function(V){ Array.prototype.push.apply(result, V.data); }); return { data: result }; } var F = function(key,values) { return values.data.sort(function(a,b){ return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0; }); } </code></pre> It has the same output as shown just above the LAST EDIT heading. Thanks, Thomas!

<ol> <li> The "chunking" comes from your code: your reduce function's values parameter can contain either <code>{time:<timestamp>,value:<value>}</code> emitted from your map function, or <code>{time:[<timestamps>],value:[<values]}</code> returned from a previous call to your reduce function. </li> <li> I don't know if it will happen in practice, but it can happen in theory. </li> <li> Simply have your map function emit the same kind of objects that your reduce function returns, i.e. <code>emit(<id>, {time: [ts], value: [P[1]]})</code>, and change your reduce function accordingly, i.e. <code>Array.push.apply(result.time, V.time)</code> and similarly for <code>result.value</code>. Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e. <code>emit(<id>, { pairs: [ {time: ts, value: P[1] ] })</code> or <code>emit(<id>, { pairs: [ [ts, P[1]] ] })</code> in the map function, and <code>Array.push.apply(result.pairs, V.pairs)</code> in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object) </li> </ol>

MongoDB Map/Reduce Array aggregation question

Tags:

mongodb

nosql

mapreduce

I have a MongoDB collection, whose docs use several levels of nesting, from which I would like to extract a multidimensional array compiled from a subset of their fields. I have a solution that works for me right now, but I want to better understand this concept of 'idempotency' and its consequences related to the reduce function.

{
  "host_name" : "gateway",
  "service_description" : "PING",
  "last_update" : 1305777787,
  "performance_object" : [
    [ "rta", 0.105, "ms", 100, 500, 0 ],
    [ "pl", 0, "%", 20, 60, 0 ]
  ]
}

And here are the map/reduce functions

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      time: ts, 
      value: P[1] 
    } );
  });
}
var R = function(key,values) {
  var result = { 
    time: [], 
    value: [] 
  };
  values.forEach(function(V){
    result.time.push(V.time);
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

Data is returned in a useful structure, which I reformat/sort with finalize for graphing.

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "time" : [
      [ 1306719302, 1306719601, 1306719903, ... ],
      [ 1306736404, 1306736703, 1306737002, ... ],
      [ 1306766401, 1306766701, 1306767001, ... ]
    ],
    "value" : [
      [ 122, 23423, 25654, ... ],
      [ 336114, 342511, 349067, ... ],
      [ 551196, 551196, 551196, ... ]
    ]
  }
}

Finally...

 [ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ]

TL;DR: What is the expected behavior with the oberved "chunking" of the array results?

I understand that the reduce function may be called multiple times on array(s) of emitted values, which is why there are several "chunks" of the complete arrays, rather than a single array. The array chunks are typically 25-50 items and it's easy enough to clean this up in finalize(). I concat() the arrays, interleave them as [time,value] and sort. But what I really want to know is if this can get more complex:

1) Is the chunking observed because of my code, MongoDB's implementation or the Map/Reduce algorithm itself?

2) Will there ever be deeper (recursive) nesting of array chunks in sharded configurations or even just because of my hasty implementation? This would break the concat() method.

3) Is there simply a better strategy for getting array results as shown above?

EDIT: Modified to emit arrays:

I took Thomas' advise and re-wrote it to emit arrays. It absolutely doesn't make any sense to split up the values.

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      value: [ ts, P[1] ] 
    } );
  });
}
var R = function(key,values) {
  var result = {
    value: [] 
  };
  values.forEach(function(V){
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

Now the output is similar to this:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "value" : [
      [ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ],
      [ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ],
      [ [1306719302,122],[1306719601,122],[1306719903,122], ... ]
    ]
  }
}

And I used this finalize function to concatenate the array chunks and sort them.

...
var F = function(key,values) {
  return (Array.concat.apply([],values.value)).sort(function(a,b){ 
    if (a[0] < b[0]) return -1;
    if (a[0] > b[0]) return 1;
    return 0;
  });
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R,
  finalize: F
});

Which works nicely:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/mnt/bck"
  },
  "value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ]
}

I guess the only question that's gnawing at me is whether this Array.concat.apply([],values.value) can be trusted to clean up the output of reduce all of the time.

LAST EDIT: Much simpler...

I have modified the document structure since the original example given above, but this only changes the example by making the map function really simple.

I'm still trying to wrap my brain around why Array.prototype.push.apply(result, V.data) works so differently from result.push(V.data)... but it works.

var M = function() {
  emit( { 
    host: this.host, 
    service: this.service, 
    metric: this.metric
  } , { 
    data: [ [ this.timestamp, this.data ] ] 
  } );
}
var R = function(key,values) {
  var result = [];
  values.forEach(function(V){
    Array.prototype.push.apply(result, V.data);
  });
  return { data: result };
}
var F = function(key,values) {
  return values.data.sort(function(a,b){
    return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0;
  });
}

It has the same output as shown just above the LAST EDIT heading.

Thanks, Thomas!

763

asked Jun 10 '11 05:06

jcampbelly

1 Answers

The "chunking" comes from your code: your reduce function's values parameter can contain either {time:<timestamp>,value:<value>} emitted from your map function, or {time:[<timestamps>],value:[<values]} returned from a previous call to your reduce function.
I don't know if it will happen in practice, but it can happen in theory.
Simply have your map function emit the same kind of objects that your reduce function returns, i.e. emit(<id>, {time: [ts], value: [P[1]]}), and change your reduce function accordingly, i.e. Array.push.apply(result.time, V.time) and similarly for result.value.

Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e. emit(<id>, { pairs: [ {time: ts, value: P[1] ] }) or emit(<id>, { pairs: [ [ts, P[1]] ] }) in the map function, and Array.push.apply(result.pairs, V.pairs) in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object)

106

answered Nov 15 '22 05:11

Thomas Broyer

Related questions
                            
                                How install PHP's MongoDB driver — "pecl install mongo" fails
                            
                                Mongodb - matching a nested document with an unknown key
                            
                                Doctrine ODM OneToOne Bi-Directional Reference using repositoryMethod
                            
                                Not authorized for query on Mongodb shard from Mongoose driver
                            
                                installing mongodb in ubuntu 15.04
                            
                                Search tool in Meteor JS
                            
                                Shutdown mongoDb server with c# driver 2.2.3
                            
                                mongodb c# API V2: Difference between ReplaceOne and FindOneAndReplace
                            
                                Validate JSON from Mongo?
                            
                                Prevent mongodb from dying with 'state should be: open'
                            
                                Logging raw queries Generated by MongoEngine
                            
                                Duplicate key error in Mongoose' findOneAndUpdate and upsert
                            
                                JSONPath query on MongoDB?
                            
                                Is there such a thing as JSON schema for MongoDB query language itself?
                            
                                choose the mongoose schema on group chat application?
                            
                                Serialize Json object with "multi-type" property
                            
                                In Ruby, MongoDB returns a BSON::OrderedHash. How to I convert it to JSON? Using to_json gives a "stack level too deep" error [closed]
                            
                                How to keep DRY when using node-mongodb-native
                            
                                MongoDB Cursor Timeouts while doing a lot of writes
                            
                                Why do my MongoDB logs fill up with "getmore local.oplog.rs"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With