I want to process a large number of records (>400k) in a batch and insert them into a database.
I know how to iterate over an array with for() or underscore.each() and also I know how to insert a record into various (no)SQL databases asynchronously. That's not the problem - the problem is I can't figure a way to do both at the same time.
The database distribution itself doesn't play a role here, the principle applies for any (NO)SQL database with an async interface.
I'm looking for a pattern to solve the following problem:
The loop approach:
var results = []; //imagine 100k objects here
_.each(results,function(row){
var newObj = prepareMyData(row);
db.InsertQuery(newObj,function(err,response) {
if(!err) console.log('YAY, inserted successfully');
});
});
This approach is obviously flawed. It kinda hammers the database with insert queries without waiting for a single one to finish. Speaking about MySQL adapters using a connection pool, you pretty soon run out of connections and the script fails.
The recursion approach:
var results = []; //again, full of BIGDATA ;)
var index = 0;
var myRecursion = function()
{
var row = results[index];
var data = prepareMyData(row);
db.InsertQuery(data,function(err, response)
{
if (!err)
{
console.log('YAY, inserted successfully!');
index++; //increment for the next recursive call of:
if (index < results.length) myRecursion();
}
}
}
myRecursion();
While this approach works pretty well for small chunks of data (though it can be slow, but that's ok. the event loop can rest a while, waiting for the query to finish), it work doesn't for large arrays - too many recursions.
I could write a batch insert easily in any other procedural language like PHP or so, but I don't want to. I want to solve this, asynchronously, in nodejs - for educational purposes.
Any suggestions?
I've found a solution that works for me, but I'm still interested in understanding how this technically works.
Reading the node-async docs I found a few functions to achieve this:
async.map //iterates over an array
async.each //iterates over an array in parallel
async.eachSeries //iterates over an array sequentially
async.eachLimit //iterates over an array in parallel with n (limit) parallel calls.
For instance:
var results = []; //still huge array
// "4" means, async will fire the iterator function up to 4 times in parallel
async.eachLimit(results,4,function(row,cb){
var data = prepareMyData(row);
db.InsertQuery(data,function(err, response)
{
if (!err)
{
cb(err,response);
}
}
},function(err,res)
{
console.log('were done!');
});
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With