Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nodejs: How can I optimize writing many files?

I'm working in a Node environment on Windows. My code is receiving 30 Buffer objects (~500-900kb each) each second, and I need to save this data to the file system as quickly as possible, without engaging in any work that blocks the receipt of the following Buffer (i.e. the goal is to save the data from every buffer, for ~30-45 minutes). For what it's worth, the data is sequential depth frames from a Kinect sensor.

My question is: What is the most performant way to write files in Node?

Here's pseudocode:

let num = 0

async function writeFile(filename, data) {
  fs.writeFileSync(filename, data)
}

// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){

  let filename = 'file-' + num++

  // Do anything with data here to optimize write?
  writeFile(filename, data)
}

fs.writeFileSync seems much faster than fs.writeFile, which is why I'm using that above. But are there any other ways to operate on the data or write to file that could speed up each save?

like image 372
ACPrice Avatar asked Jan 29 '23 02:01

ACPrice


2 Answers

First off, you never want to use fs.writefileSync() in handling real-time requests because that blocks the entire node.js event loop until the file write is done.

OK, based on writing each block of data to a different file, then you want to allow multiple disk writes to be in process at the same time, but not unlimited disk writes. So, it's still appropriate to use a queue, but this time the queue doesn't just have one write in process at a time, it has some number of writes in process at the same time:

const EventEmitter = require('events');

class Queue extends EventEmitter {
    constructor(basePath, baseIndex, concurrent = 5) {
        this.q = [];
        this.paused = false;
        this.inFlightCntr = 0;
        this.fileCntr = baseIndex;
        this.maxConcurrent = concurrent;
    }

    // add item to the queue and write (if not already writing)
    add(data) {
        this.q.push(data);
        write();
    }

    // write next block from the queue (if not already writing)
    write() {
        while (!paused && this.q.length && this.inFlightCntr < this.maxConcurrent) {
            this.inFlightCntr++;
            let buf = this.q.shift();
            try {
                fs.writeFile(basePath + this.fileCntr++, buf, err => {
                    this.inFlightCntr--;
                    if (err) {
                        this.err(err);
                    } else {
                        // write more data
                        this.write();
                    }
                });
            } catch(e) {
                this.err(e);
            }
        }
    }

    err(e) {
        this.pause();
        this.emit('error', e)
    }

    pause() {
        this.paused = true;
    }

    resume() {
        this.paused = false;
        this.write();
    }
}

let q = new Queue("file-", 0, 5);

// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
    q.add(data);
}

q.on('error', function(e) {
    // go some sort of write error here
    console.log(e);
});

Things to consider:

  1. Experiment with the concurrent value you pass to the Queue constructor. Start with a value of 5. Then see if raising that value any higher gives you better or worse performance. The node.js file I/O subsystem uses a thread pool to implement asynchronous disk writes so there is a max number of concurrent writes that will allow so cranking the concurrent number up really high probably does not make things go faster.

  2. You can experiement with increasing the size of the disk I/O thread pool by setting the UV_THREADPOOL_SIZE environment variable before you start your node.js app.

  3. Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD on a fast bus would be best.

  4. If you can spread the writes out across multiple actual physical disks, that will likely also increase write throughput (more disk heads at work).


This is a prior answer based on the initial interpretation of the question (before editing that changed it).

Since it appears you need to do your disk writes in order (all to the same file), then I'd suggest that you either use a write stream and let the stream object serialize and cache the data for you or you can create a queue yourself like this:

const EventEmitter = require('events');

class Queue extends EventEmitter {
    // takes an already opened file handle
    constructor(fileHandle) {
        this.f = fileHandle;
        this.q = [];
        this.nowWriting = false;
        this.paused = false;
    }

    // add item to the queue and write (if not already writing)
    add(data) {
        this.q.push(data);
        write();
    }

    // write next block from the queue (if not already writing)
    write() {
        if (!nowWriting && !paused && this.q.length) {
            this.nowWriting = true;
            let buf = this.q.shift();
            fs.write(this.f, buf, (err, bytesWritten) => {
                this.nowWriting = false;
                if (err) {
                    this.pause();
                    this.emit('error', err);
                } else {
                    // write next block
                    this.write();
                }
            });
        }
    }

    pause() {
        this.paused = true;
    }

    resume() {
        this.paused = false;
        this.write();
    }
}

// pass an already opened file handle
let q = new Queue(fileHandle);

// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
    q.add(data);
}

q.on('error', function(err) {
    // got disk write error here
});

You could use a writeStream instead of this custom Queue class, but the problem with that is that the writeStream may fill up and then you'd have to have a separate buffer as a place to put the data anyway. Using your own custom queue like above takes care of both issues at once.

Other Scalability/Performance Comments

  1. Because you appear to be writing the data serially to the same file, your disk writing won't benefit from clustering or running multiple operations in parallel because they basically have to be serialized.

  2. If your node.js server has other things to do besides just doing these writes, there might be a slight advantage (would have to be verified with testing) to creating a second node.js process and doing all the disk writing in that other process. Your main node.js process would receive the data and then pass it to the child process that would maintain the queue and do the writing.

  3. Another thing you could experiment with is coalescing writes. When you have more than one item in the queue, you could combine them together into a single write. If the writes are already sizable, this probably doesn't make much difference, but if the writes were small this could make a big difference (combining lots of small disk writes into one larger write is usually more efficient).

  4. Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD would be best.

like image 87
jfriend00 Avatar answered Jan 31 '23 07:01

jfriend00


I have written a service that does this extensively and the best thing you can do is to pipe the input data directly to the file (if you have an input stream as well). A simple example where you download a file in such a way:

const http = require('http')

const ostream = fs.createWriteStream('./output')
http.get('http://nodejs.org/dist/index.json', (res) => {
    res.pipe(ostream)                                                                                                                                                                                              
})
.on('error', (e) => {
    console.error(`Got error: ${e.message}`);
})

So in this example there is no intermediate copying involved of the whole file. As the file is read in chunks from the remote http server it is written to the file on disk. This is much more efficient that downloading a whole file from the server, saving that in memory and then writing it to a file on disk.

Streams are a basis of many operations in Node.js so you should study those as well.

One other thing that you should investigate depending on your scenarios is UV_THREADPOOL_SIZE as I/O operations use libuv thread pool that is by default set to 4 and you might fill that up if you do a lot of writing.

like image 45
Petar Avatar answered Jan 31 '23 07:01

Petar