Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nodejs: Batch insert of large number of rows into a database

I want to process a large number of records (>400k) in a batch and insert them into a database.

I know how to iterate over an array with for() or underscore.each() and also I know how to insert a record into various (no)SQL databases asynchronously. That's not the problem - the problem is I can't figure a way to do both at the same time.

The database distribution itself doesn't play a role here, the principle applies for any (NO)SQL database with an async interface.

I'm looking for a pattern to solve the following problem:

The loop approach:

var results = []; //imagine 100k objects here
_.each(results,function(row){
  var newObj = prepareMyData(row);

  db.InsertQuery(newObj,function(err,response) {
    if(!err) console.log('YAY, inserted successfully');
  });

});

This approach is obviously flawed. It kinda hammers the database with insert queries without waiting for a single one to finish. Speaking about MySQL adapters using a connection pool, you pretty soon run out of connections and the script fails.

The recursion approach:

var results = []; //again, full of BIGDATA ;)
var index = 0;
var myRecursion = function()
{
  var row = results[index];
  var data = prepareMyData(row);
  db.InsertQuery(data,function(err, response)
  {
    if (!err)
    {

    console.log('YAY, inserted successfully!');
    index++; //increment for the next recursive call of:
    if (index < results.length) myRecursion();
    }
  }
}
myRecursion();

While this approach works pretty well for small chunks of data (though it can be slow, but that's ok. the event loop can rest a while, waiting for the query to finish), it work doesn't for large arrays - too many recursions.

I could write a batch insert easily in any other procedural language like PHP or so, but I don't want to. I want to solve this, asynchronously, in nodejs - for educational purposes.

Any suggestions?

like image 494
Steffen Avatar asked Oct 02 '22 18:10

Steffen


1 Answers

I've found a solution that works for me, but I'm still interested in understanding how this technically works.

Reading the node-async docs I found a few functions to achieve this:

async.map //iterates over an array

async.each //iterates over an array in parallel

async.eachSeries //iterates over an array sequentially

async.eachLimit //iterates over an array in parallel with n (limit) parallel calls.

For instance:

var results = []; //still huge array
// "4" means, async will fire the iterator function up to 4 times in parallel
async.eachLimit(results,4,function(row,cb){
  var data = prepareMyData(row);
  db.InsertQuery(data,function(err, response)
  {
    if (!err)
    {
        cb(err,response);
    }
  }
},function(err,res)
{
    console.log('were done!');
});
like image 60
Steffen Avatar answered Oct 13 '22 10:10

Steffen