Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pass large array to node child process

I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process.

var spawn = require('child_process').spawn;

// dataAsNumbers is a large 2D array
var child = spawn(process.execPath, ['/child_process_scripts/getStatistics', dataAsNumbers]);

child.stdout.on('data', function(data){
  console.log('from child: ', data.toString());
});

But when I do, node gives the error:

spawn E2BIG

I came across this article

So piping the data to the child process seems to be the way to go. My code is now:

var spawn = require('child_process').spawn;

console.log('creating child........................');

var options = { stdio: [null, null, null, 'pipe'] };
var args = [ '/getStatistics' ];
var child = spawn(process.execPath, args, options);

var pipe = child.stdio[3];

pipe.write(Buffer('awesome'));

child.stdout.on('data', function(data){
  console.log('from child: ', data.toString());
});

And then in getStatistics.js:

console.log('im inside child');

process.stdin.on('data', function(data) {
  console.log('data is ', data);
  process.exit(0);
});

However the callback in process.stdin.on isn't reached. How can I receive a stream in my child script?

EDIT

I had to abandon the buffer approach. Now I'm sending the array as a message:

var cp = require('child_process');
var child = cp.fork('/getStatistics.js');

child.send({ 
  dataAsNumbers: dataAsNumbers
});

But this only works when the length of dataAsNumbers is below about 20,000, otherwise it times out.

like image 592
Mark Avatar asked May 18 '17 16:05

Mark


2 Answers

With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.

shm-typed-array is a very simple module that seems suited to your application. Example:

parent.js

"use strict";

const shm = require('shm-typed-array');
const fork = require('child_process').fork;

// Create shared memory
const SIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');

// Fill with dummy data
Array.prototype.fill.call(data, 1);

// Spawn child, set up communication, and give shared memory
const child = fork("child.js");
child.on('message', sum => {
    console.log(`Got answer: ${sum}`);

    // Demo only; ideally you'd re-use the same child
    child.kill();
});
child.send(data.key);

child.js

"use strict";

const shm = require('shm-typed-array');

process.on('message', key => {
    // Get access to shared memory
    const data = shm.get(key, 'Float64Array');

    // Perform processing
    const sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);

    // Return processed data
    process.send(sum);
});

Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.

Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.

like image 111
rvighne Avatar answered Nov 10 '22 23:11

rvighne


I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following

// main.js
const fork = require('child_process').fork

const child = fork('./getStats.js')

const dataAsNumbers = Array(100000).fill(0).map(() =>
  Array(100).fill(0).map(() => Math.round(Math.random() * 100)))

child.send({
  dataAsNumbers: dataAsNumbers,
})

And

// getStats.js
process.on('message', function (data) {
  console.log('data is ', data)
  process.exit(0)
})

node main.js 2.72s user 0.45s system 103% cpu 3.045 total

I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.


If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.


Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.

like image 25
Preview Avatar answered Nov 10 '22 22:11

Preview