I'm trying to write a small node application that will search through and parse a large number of files on the file system. In order to speed up the search, we are attempting to use some sort of map reduce. The plan would be the following simplified scenario:
The questions I have with this are: Is this doable in Node? What is the recommended way of doing it?
I've been fiddling, but come no further then following example using Process:
initiator:
function Worker() { return child_process.fork("myProcess.js); } for(var i = 0; i < require('os').cpus().length; i++){ var process = new Worker(); process.send(workItems.slice(i * itemsPerProcess, (i+1) * itemsPerProcess)); }
myProcess.js
process.on('message', function(msg) { var valuesToReturn = []; // Do file reading here //How would I return valuesToReturn? process.exit(0); }
Few sidenotes:
Worker Threads help us offload CPU intensive tasks away from the Event Loop to be executed parallelly in a non-blocking manner. A worker thread runs a piece of code as instructed by the parent thread in isolation from the parent and other worker threads.
First, you won't really be running in parallel while in a single node application. A node application runs on a single thread and only one event at a time is processed by node's event loop. Even when running on a multi-core box you won't get parallelism of processing within a node application.
NodeJS is a runtime environment for JavaScript. It's server-side and single threaded. That being said, we want to do things asynchronously and in parallel. Now, Node uses several threads, just one execution thread, and a lot goes into it to make it asynchronous, such as queues and the libuv library.
Should be doable. As a simple example:
// parent.js var child_process = require('child_process'); var numchild = require('os').cpus().length; var done = 0; for (var i = 0; i < numchild; i++){ var child = child_process.fork('./child'); child.send((i + 1) * 1000); child.on('message', function(message) { console.log('[parent] received message from child:', message); done++; if (done === numchild) { console.log('[parent] received all results'); ... } }); } // child.js process.on('message', function(message) { console.log('[child] received message from server:', message); setTimeout(function() { process.send({ child : process.pid, result : message + 1 }); process.disconnect(); }, (0.5 + Math.random()) * 5000); });
So the parent process spawns an X number of child processes and passes them a message. It also installs an event handler to listen for any messages sent back from the child (with the result, for instance).
The child process waits for messages from the parent, and starts processing (in this case, it just starts a timer with a random timeout to simulate some work being done). Once it's done, it sends the result back to the parent process and uses process.disconnect()
to disconnect itself from the parent (basically stopping the child process).
The parent process keeps track of the number of child processes started, and the number of them that have sent back a result. When those numbers are equal, the parent received all results from the child processes so it can combine all results and return the JSON result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With