I'm working on a complicated map-reduce process for a mongodb database. I've split some of the more complex code off into modules, which I then make available to my map/reduce/finalize functions by including it in my scopeObj
like so:
const scopeObj = {
userCalculations: require('../lib/userCalculations')
}
function myMapFn() {
let userScore = userCalculations.overallScoreForUser(this)
emit({
'Key': this.userGroup
}, {
'UserCount': 1,
'Score': userScore
})
}
function myReduceFn(key, objArr) { /*...*/ }
db.collection('userdocs').mapReduce(
myMapFn,
myReduceFn,
{
scope: scopeObj,
query: {},
out: {
merge: 'userstats'
}
},
function (err, stats) {
return cb(err, stats);
}
)
...This all works fine. I had until recently thought it wasn't possible to include module code into a map-reduce scopeObj
, but it turns out that was just because the modules I was trying to include all had dependencies on other modules. Completely standalone modules appear to work just fine.
Which brings me (finally) to my question. How can I -- or, for that matter, should I -- incorporate more complex modules, including things I've pulled from npm, into my map-reduce code? One thought I had was using Browserify or something similar to pull all my dependencies into a single file, then include it somehow... but I'm not sure what the right way to do that would be. And I'm also not sure of the extent to which I'm risking severely bloating my map-reduce code, which (for obvious reasons) has got to be efficient.
Does anyone have experience doing something like this? How did it work out, if at all? Am I going down a bad path here?
UPDATE: A clarification of what the issue is I'm trying to overcome:
In the above code, require('../lib/userCalculations')
is executed by Node -- it reads in the file ../lib/userCalculations.js
and assigns the contents of that file's module.exports
object to scopeObj.userCalculations
. But let's say there's a call to require(...)
somewhere within userCalculations.js
. That call isn't actually executed yet. So, when I try to call userCalculations.overallScoreForUser()
within the Map function, MongoDB attempts to execute the require
function. And require
isn't defined on mongo.
Browserify, for example, deals with this by compiling all the code from all the required modules into a single javascript file with no require
calls, so it can be run in the browser. But that doesn't exactly work here, because I need to be the resulting code to itself be a module that I can use like I use userCalculations
in the code sample. Maybe there's a weird way to run browserify that I'm not aware of? Or some other tool that just "flattens" a whole hierarchy of modules into a single module?
Hopefully that clarifies a bit.
In MongoDB, map-reduce operations use custom JavaScript functions to map, or associate, values to a key. If a key has multiple values mapped to it, the operation reduces the values for the key to a single object. The use of custom JavaScript functions provide flexibility to map-reduce operations.
MongoDB provides the mapReduce() function to perform the map-reduce operations. This function has two main functions, i.e., map function and reduce function. The map function is used to group all the data based on the key-value and the reduce function is used to perform operations on the mapped data.
Map-reduce is a common pattern when working with Big Data – it's a way to extract info from a huge dataset. But now, starting with version 2.2, MongoDB includes a new feature called Aggregation framework. Functionality-wise, Aggregation is equivalent to map-reduce but, on paper, it promises to be much faster.
$reduce. Applies an expression to each element in an array and combines them into a single value.
As a generic response, the answer to your question: How can I -- or, for that matter, should I -- incorporate more complex modules, including things I've pulled from npm, into my map-reduce code? - is no, you can not safely include complex modules in node code you plan to send to MongoDB for mapReduce jobs.
You mentioned the problem yourself - nested require
statements. Now, require is sync, but if you have nested functions inside, these require calls would not be executed until call time, and MongoDB VM would throw at this point.
Consider the following example of three files: data.json
, dep.js
and main.js
.
// data.json - just something we require "lazily"
false
// dep.js -- equivalent of your userCalculations
module.exports = {
isValueTrue() {
// The problem: nested require
return require('./data.json');
}
}
// main.js - from here you send your mapReduce to MongoDB.
// require dependency instantly
const calc = require('./dep.js');
// require is synchronous, the effectis the same if you do:
// const calc = (function () {return require('./dep.js')})();
console.log('Calc is loaded.');
// Let's mess with unwary devs
require('fs').writeFileSync('./data.json', 'false');
// Is calc.isValueTrue() true or false here?
console.log(calc.isValueTrue());
As a general solution, this is not feasible. While vast majority of modules will likely not have nested require
statements, HTTP calls, or even internal, service calls, global variables and similar, there are those who do. You cannot guarantee that this would work.
Now, as a your local implementation: e.g. you require exactly specific versions of NPM modules that you have tested well with this technique and you know it will work, or you published them yourself, it is somewhat feasible.
However, even if this case, if this is a team effort, there's bound to be a developer down the line who will not know where your dependency is used or how, use globals (not on purpose, but by ommission, e.g they wrongly calculate this
) or simply not know the implications of whatever they are doing. If you have strong integration testing suite, you could guard against this, but the thing is, it's unpredictable. Personally I think that when you can choose between unpredictable and predictable, almost always you should use predictable.
Now, if you have an explicitly stated purpose for a certain library to be used in MongoDB mapReduce, this would work. You would have to guard well against ommisions and problems, and have strong testing infra, but I would make certain the purpose is explicit before feeling safe enough to do this. But of course, if you're using something that is so complex that you need several npm packages to do, maybe you can have those functions directly on MongoDB server, maybe you can do your mapReducing in something better suited for the purpose, or similar.
To conclude: As a purposefuly built library with explicit mission statement that it is to be used with node and MongoDB mapReduce, I would make sure my tests cover all my mission-critical and important functionality, and then import such npm package. Otherwise I would not use nor recommend this approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With