I have some 25k documents (4 GB in raw json) of data that I want to perform a few javascript operations on to make it more accessible to my end data consumer (R
), and I would like to sort of "version control" these changes by adding a new collection for each change, but I cannot figure out how to map/reduce
without the reduce
. I want a one-to-one document mapping—I start out with 25,356 documents in collection_1
, and I want to end up with 25,356 documents in collection_2
.
I can hack it with this:
var reducer = function(key, value_array) {
return {key: value_array[0]}
}
And then call it like:
db.flat_1.mapReduce(mapper, reducer, {keeptemp: true, out: 'flat_2'})
(My mapper only calls emit once, with a string as the first argument and the final document as the second. It's a collection of those second arguments that I really want.)
But that seems awkward and I don't know why it even works, since my emit
call arguments in my mapper are not equivalent to the return argument of my reducer
. Plus, I end up with a document like
{
"_id": "0xWH4T3V3R",
"value": {
"key": {
"finally": ["here"],
"thisIsWhatIWanted": ["Yes!"]
}
}
}
which seems unnecessary.
Also, a cursor that performs its own inserts isn't even a tenth as fast as mapReduce
. I don't know MongoDB well enough to benchmark it, but I would guess it's about 50x
slower. Is there a way to run through a cursor in parallel? I don't care if the documents in my collection_2
are in a different order than those in collection_1
.
When using map/reduce you'll always end up with
{ "value" : { <reduced data> } }
In order to remove the value
key you'll have to use a finalize
function.
Here's the simplest you can do to copy data from one collection to another:
map = function() { emit(this._id, this ); }
reduce = function(key, values) { return values[0]; }
finalize = function(key, value) { db.collection_2.insert(value); }
Then when you would run as normal:
db.collection_1.mapReduce(map, reduce, { finalize: finalize });
But that seems awkward and I don't know why it even works, since my
emit
call arguments in my mapper are not equivalent to the return argument of myreducer
.
They are equivalent. The reduce function takes in an array of T
values and should return a single value in the same T
format. The format of T
is defined by your map function. Your reduce function simply returns the first item in the values array, which will always be of type T
. That's why it works :)
You seem to be on the right track. I did some experimenting and it seems you cannot do a db.collection.save()
from the map function, but you can do this from the reduce function. Your map function should simply construct the document format you need:
function map() {
emit(this._id, { _id: this.id, heading: this.title, body: this.content });
}
The map function reuses the ID of the original document. This should prevent any re-reduce steps, since no values will share the same key.
The reduce function can simply return null
. But in addition, you can write the value to a separate collection.
function reduce(key, values) {
db.result.save(values[0]);
return null;
}
Now db.result
should contain the transformed documents, without any additional map-reduce noise you'd have in the temporary collection. I haven't actually tested this on large amounts of data, but this approach should take advantage of the parallelized execution of map-reduce functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With