Using multiple cores to process large, sequential file in c++

Question

I have a large file (bigger then RAM, can't read whole at once) and i need to process it row by row (in c++). I want to utilize multiple cores, preferably with Intel TBB or Microsoft PPL. I would rather avoid preprocessing this file (like splitting it to 4 parts etc).

I was thinking about something like using 4 iterators, initialized to (0, n/4, 2*n/4 3*n/4) positions in the file etc.

Is it good solution and is there simple way to achieve it?

Or maybe you know some libs that supports efficient, concurrent reading of streams?

update:

I did tests. IO is not the bottleneck, CPU is. And I have lot of RAM for buffers.

I need to parse record (var size, approx. 2000 bytes each, records are separated by unique '\0' char), validate it, do some calculations, and write result to another file(s)

SoapBox · Accepted Answer

Since you are able to split it into N parts, it sounds like the processing of each row is largely independent. In that case, I think the simplest solution is to set up one thread to read the file line by line and place each row into a tbb::concurrent_queue. Then spawn as many threads as you need to pull rows off that queue and process them.

This solution is independent of the file size, and if you find you need more (or less) worker threads its trivial to change the number. But this won't work if there's some kind of dependencies between the rows... unless you set up a second poll of "post processing" threads to handle that, but then things may start to get too complex.

Alexey Kukanov · Answer

My recommendation is to use TBB's pipeline pattern. The first, serial stage of the pipeline reads a desired portion of data from file; subsequent stages process data chunks in parallel, and the last stage writes into another file, possibly in the same order as the data were read.

An example for this approach is available in TBB distributions; see examples/pipeline/square. It uses "old" interface, the class tbb::pipeline and filters (classes inherited from tbb::filter) that pass data by void* pointers. A newer, type-safe and lambda-friendly "declarative" interface tbb::parallel_pipeline() may be more convenient to use.

Using multiple cores to process large, sequential file in c++

Tags:

c++

multithreading

multicore

tbb

Piotr

2 Answers

SoapBox

Alexey Kukanov

Recent Activity

Donate For Us

Using multiple cores to process large, sequential file in c++

Tags:

c++

multithreading

multicore

tbb

Piotr

2 Answers

SoapBox

Alexey Kukanov

Related questions

Recent Activity

Donate For Us