Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreaded Files Reading

I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.

EDIT

File size = N * 8192 Bytes

All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.

Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.

I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.

Would there be any difference if this method is used on the SSD instead?

Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?

What would you suggest?

like image 448
Andrea Chua Avatar asked Dec 21 '16 10:12

Andrea Chua


2 Answers

What would you suggest?

Don't use mmap()

Per Linux Torvalds himself:

People love mmap() and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap:

  • quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
    mappings. It's The TLB flush needed after unmapping stuff.
  • page faulting is expensive. That's how the mapping gets populated, and it's quite slow.

Upsides of mmap:

  • if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.

This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.

  • if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.

    And the automatic sharing is obviously a case of this.

But your test-suite (just copying the data once) is probably pessimal for mmap().

Note the last - just using the data once is a bad use-case for mmap().

For a file on an SSD, since there are no physical head seek movements:

  1. Open the file once, using open() to get a single int file descriptor.

  2. Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.

You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.

For a file on mechanical disk(s):

You want to minimize head seek(s) on the mechanical disk.

Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)

  1. Using a single producer thread, read large chunks (say 64k to 1MB+) into one of N page-aligned buffers.
  2. When a buffer is read, pass it to the worker threads, then read to fill the next buffer

  3. When all workers are done using their part of the buffer, pass the buffer back to the reading thread.

You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.

like image 185
Andrew Henle Avatar answered Nov 16 '22 19:11

Andrew Henle


Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....

If the system this is to run on can be dedicated to the problem:

  1. Obtain the file size (fstat)
  2. Allocate a buffer that size.
  3. Open and read the whole file into the buffer.
  4. Figure out how to partition the data per thread and spin off the threads.
  5. Time that algorithm.

Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.

like image 4
wallyk Avatar answered Nov 16 '22 19:11

wallyk