Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run an async function for each line of a very large (> 1GB) file in Node.js

Say you have a huge (> 1GB) CSV of record ids:

655453
4930285
493029
4930301
493031
...

And for each id you want to make a REST API call to fetch the record data, transform it locally, and insert it into a local database.

How do you do that with Node.js' Readable Stream?

My question is basically this: How do you read a very large file, line-by-line, run an async function for each line, and [optionally] be able to start reading the file from a specific line?

From the following Quora question I'm starting to learn to use fs.createReadStream:

http://www.quora.com/What-is-the-best-way-to-read-a-file-line-by-line-in-node-js

var fs = require('fs');
var lazy = require('lazy');

var stream = fs.createReadStream(path, {
  flags: 'r',
  encoding: 'utf-8'
});

new lazy(stream).lines.forEach(function(line) {
  var id = line.toString();
  // pause stream
  stream.pause();
  // make async API call...
  makeAPICall(id, function() {
    // then resume to process next id
    stream.resume();
  });
});

But, that pseudocode doesn't work, because the lazy module forces you to read the whole file (as a stream, but there's no pausing). So that approach doesn't seem like it will work.

Another thing is, I would like to be able to start processing this file from a specific line. The reason for this is, processing each id (making the api call, cleaning the data, etc.) can take up to a half a second per record so I don't want to have to start from the beginning of the file each time. The naive approach I'm thinking about using is to just capture the line number of the last id processed, and save that. Then when you parse the file again, you stream through all the ids, line by line, until you find the line number you left off at, and then you do the makeAPICall business. Another naive approach is to write small files (say of 100 ids) and process each file one at a time (small enough dataset to do everything in memory without an IO stream). Is there a better way to do this?

I can see how this gets tricky (and where node-lazy comes in) because the chunk in stream.on('data', function(chunk) {}); may contain only part of a line (if the bufferSize is small, each chunk may be 10 lines but because the id is variable length, it may only be 9.5 lines or whatever). This is why I'm wondering what the best approach is to the above question.

like image 367
Lance Avatar asked Jun 18 '12 06:06

Lance


People also ask

How do you write async await function in Node JS?

function asyncThing (value) { return new Promise((resolve) => { setTimeout(() => resolve(value), 100); }); } async function main () { return [1,2,3,4]. map(async (value) => { const v = await asyncThing(value); return v * 2; }); } main() . then(v => console.

Does async need Athen or await?

We recommend using async/await where possible, and minimize promise chaining. Async/await makes JavaScript code more accessible to developers that aren't as familiar with JavaScript, and much easier to read.


1 Answers

Related to Andrew Андрей Листочкин's answer:

You can use a module like byline to get a separate data event for each line. It's a transform stream around the original filestream, which produces a data event for each chunk. This lets you pause after each line.

byline won't read the entire file into memory like lazy apparently does.

var fs = require('fs');
var byline = require('byline');

var stream = fs.createReadStream('bigFile.txt');
stream.setEncoding('utf8');

// Comment out this line to see what the transform stream changes.
stream = byline.createStream(stream); 

// Write each line to the console with a delay.
stream.on('data', function(line) {
  // Pause until we're done processing this line.
  stream.pause();

  setTimeout(() => {
      console.log(line);

      // Resume processing.
      stream.resume();
  }, 200);
});
like image 169
Chris Sidi Avatar answered Oct 12 '22 17:10

Chris Sidi