Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NodeJS: How to write a file parser using readStream?

Tags:

node.js

I have a file in a binary format:

The format is as follows:

[4 - header bytes] [8 bytes - int64 - how many bytes to read following] [variable num of bytes (size of the int64) - read the actual information]

And then it repeats, so I must first read the first 12 bytes to determine how many more bytes I need to read.

I have tried:

var readStream = fs.createReadStream('/path/to/file.bin');
readStream.on('data', function(chunk) {  ...  })

The problem I have is that chunk always comes back in chunks of 65536 bytes at a time whereas I need to be more specific on the number of bytes that I am reading.

I have always tried readStream.on('readable', function() { readStream.read(4) }) But it is also not very flexible, because it seems to turn asynchronous code into synchronous code because, I have to put the 'reading' in a while loop

Or maybe readStream is not appropriate in this case and I should use this instead? fs.read(fd, buffer, offset, length, position, callback)

like image 335
samol Avatar asked Mar 02 '26 19:03

samol


1 Answers

Here's what I'd recommend as an abstract handler of a readStream to process abstract data like you're describing:

var pending = new Buffer(9999999);
var cursor = 0;
stream.on('data', function(d) {
  d.copy(pending, cursor);
  cursor += d.length;

  var test = attemptToParse(pending.slice(0, cursor));
  while (test !== false) {
    // test is a valid blob of data
    processTheThing(test);

    var rawSize = test.raw.length; // How many bytes of data did the blob actually take up?
    pending.copy(pending.copy, 0, rawSize, cursor); // Copy the data after the valid blob to the beginning of the pending buffer
    cursor -= rawSize;
    test = attemptToParse(pending.slice(0, cursor)); // Is there more than one valid blob of data in this chunk? Keep processing if so
  }
});

For your use-case, ensure the initialized size of the pending Buffer is large enough to hold the largest possible valid blob of data you'll be parsing (you mention an int64; that max size plus the header size) plus one extra 65536 bytes in case the blob boundary happens just on the edge of a stream chunk.

My method requires a attemptToParse() method that takes a buffer and tries to parse the data out of it. It should return false if the length of the buffer is too short (data hasn't come in enough yet). If it is a valid object, it should return some parsed object that has a way to show the raw bytes it took up (.raw property in my example). Then you do any processing you need to do with the blob (processTheThing()), trim out that valid blob of data, shift the pending Buffer to just be the remainder and keep going. That way, you don't have a constantly growing pending buffer, or some array of "finished" blobs. Maybe process on the receiving end of processTheThing() is keeping an array of the blobs in memory, maybe it's writing them to a database, but in this example, that's abstracted away so this code just deals with how to handle the stream data.

like image 141
MidnightLightning Avatar answered Mar 04 '26 11:03

MidnightLightning