Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to iterate large arrays in NodeJS

If I have to do iterations of large sets of data in Node, what are some precautions that I can take in an attempt to avoid making the server less responsive to other requests? The amount of time it takes for the iteration to finish is not of crucial importance to me, should I look into something like this or are there any other similar tricks that I should know about?

like image 978
wdGelwix Avatar asked Jan 26 '23 06:01


1 Answers

Here are some considerations for manipulating large data sets in nodejs which come out of my experiences dealing with data sets in the billions and single arrays of 100,000,000 items.

1. Minimize Garbage Collection Work. To the best of your ability, avoid creating temporary objects in the main loop that is processing the large data set. This includes locally scoped variables (where a new variable is created through each invocation of the loop) and includes any functions/methods that return objects. If your code creates 10 objects each time through the loop and the array has 1.2 million items in it, that's 10.2 million objects the GC has to deal with. Besides all the CPU that it takes the GC to process those, it's also a lot of peak memory usage since the GC lets things accumulate until memory gets scarce or until it finds some idle time.

2. Measure the time it takes to process your worst case array and improve it as much as you can. Work on the performance of the loop processing with specific performance tests so you now know exactly what the max array processing time is.

3. Decide what latency delay is acceptable in your server. This really depends upon the application and how often this delay will be encountered so you will have to figure out what will work for you. An occasional 100ms delay is probably no big deal for lots of applications, but if that occurs frequently it becomes a problem or if you have some sort of responsiveness-critical aspect to your server (such as gaming), then 100ms would be way too long.

4. Move processing to Worker Threads. If your best performance is worse than your acceptable latency, then you will probably want to move the processing to nodejs Worker Threads. It probably makes sense to create a pool of threads (one per actual CPU core in your server) and then create a work queue that is serviced in FIFO order. When a large array job needs to be done, you put it in the queue and return a promise. If a worker thread is available, the job is immediately sent off to the Worker Thread. If all worker threads are busy, it sits in the queue until a thread is finished and is free. At that point, the oldest item in the queue (FIFO order) is sent of to the Worker Thread. When a worker thread finishes the job, the result is communicated back and a promise is resolved and the code waiting for the result gets the resolved promise notification.

5. Use SharedArrayBuffer if possible. You don't want to be copying large amounts of data back and forth between Worker Threads as that will eat up CPU and cause lots of work for the CPU. A key technique to processing large amounts of data in Worker Threads is to put that data in a SharedArrayBuffer that can be directly passed to the Worker Thread as a reference without any copying. This is hugely more efficient for CPU, GC and peak memory use.

6. Understand concurrency consequences of using a SharedArrayBuffer. A SharedArrayBuffer being operated on by Worker Threads is one place in node.js where you can be exposed to multi-thread race conditions. So, you need a design model for how you're going to do it. The simplest model is to set things up so that only one thread EVER has access to the same SharedArrayBuffer. You create it in the main thread and then when you pass it off to the Worker Thread for processing, you pass the SharedArrayBuffer reference to the WorkerThread and you completely forget about it in the main thread (store it nowhere else). This means that the main thread essentially passes temporary ownership of it to the Worker Thread. When the Worker Thread finishes, it passes ownership back (returning the SharedArrayBuffer reference in the result message it sends). This model is simple because you can't accidentally access it from two threads if you make sure that no more than one thread EVER has a reference to it at the same time.

7. Use Atomics to protect shared data. If you can't use a simple access model for the SharedArrayBuffer as discussed above, then you may need to use Atomics to protect the integrity of data.

Some other design options to consider:

1. Break up the data and process it in chunks. You can write the processing in chunks such that you program a short delay between chunks so the main thread has an opportunity to processing messages between chunks. This is how we were forced to do things before we had access to threads. See Best way to iterate over an array without blocking the UI for an example. How practical this is or how much of a rewrite this would cause really depends upon the problem and the data. On a server, I would probably tend to use threads these days rather than try to break the processing into small tiny chunks.

2. Consider if a database can help you. Databases are for managing large sets of data and they typically do it in a separate process (which helps with the server responsiveness issue).

3. Worker List class. Here's a WorkerList class that I used in order to queue up data to use a worker pool. This is part of a larger crypto test app that used multiple threads to offload large amounts of crypto work. The whole repository is here on Github.

4. Work on the data incrementally as it arrives. You mentioned "prepare them for database insertion". Depending upon the specific problem, you may not have to even accumulate large amounts of data at all. Maybe you can process the data more incrementally as it arrives and, by doing it as you go, you never end up with the giant job that interferes with your main server work. to a point where you have 1.2 million item arrays.

like image 85
jfriend00 Avatar answered Jan 27 '23 21:01