I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for
to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?
What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
- quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.- page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
- if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal for mmap().
Note the last - just using the data once is a bad use-case for mmap()
.
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open()
to get a single int
file descriptor.
Use pread()
per thread to read appropriate 8kB chunks. pread()
reads from a specified offset without using lseek()
, and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open()
with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );
) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the buffer back to the reading thread.
You'll need to experiment with the proper read()
size, the number of worker threads, and the number of buffers passed around. Larger read()
s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.
Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
fstat
)Then, revise it using asynchronous reading. See man aio_read
and man 7 aio
to learn what needs to be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With