Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Process same file in two threads using ifstream

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?

Immediately I can think of two ways,

  1. Construct a new ifstream for the second thread, open it on the same file.
  2. Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.

Is one of these two methods preferred?

Is there a third (or fourth) option that I have not yet thought of?

Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.

Thanks.

like image 968
J T Avatar asked Jun 02 '11 15:06

J T


4 Answers

Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.

If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.

like image 157
Cory Nelson Avatar answered Nov 11 '22 13:11

Cory Nelson


Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.

For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.

Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.

like image 21
Billy ONeal Avatar answered Nov 11 '22 11:11

Billy ONeal


Other option:

  • Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).
like image 4
Ben Voigt Avatar answered Nov 11 '22 11:11

Ben Voigt


It really depends on your system. A modern system will generally read ahead; seeking within the file is likely to inhibit this, so should definitly be avoided.

It might be worth experimenting how read-ahead works on your system: open the file, then read the first half of it sequentially, and see how long that takes. Then open it, seek to the middle, and read the second half sequentially. (On some systems I've seen in the past, a simple seek, at any time, will turn off read-ahead.) Finally, open it, then read every other record; this will simulate two threads using the same file descriptor. (For all of these tests, use fixed length records, and open in binary mode. Also take whatever steps are necessary to ensure that any data from the file is purged from the OS's cache before starting the test—under Unix, copying a file of 10 or 20 Gigabytes to /dev/null is usually sufficient for this.

That will give you some ideas, but to be really certain, the best solution would be to test the real cases. I'd be surprised if sharing a single ifstream (and thus a single file descriptor), and constantly seeking, won, but you never know.

I'd also recommend system specific solutions like mmap, but if you've got that much data, there's a good chance you won't be able to map it all in one go anyway. (You can still use mmap, mapping sections of it at a time, but it becomes a lot more complicated.)

Finally, would it be possible to get the data already cut up into smaller files? That might be the fastest solution of all. (Ideally, this would be done where the data is generated or imported into the system.)

like image 1
James Kanze Avatar answered Nov 11 '22 13:11

James Kanze