I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstream
s reading from two unique file offsets of the same file. I can't just start one ifstream
up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
ifstream
for the second thread, open it on the same file.ifstream
across both threads (using for instance boost::shared_ptr<>
). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.
Two std::ifstream
instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream
instances concurrently should give quite nice performance.
If you have a single std::ifstream
you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.
Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.
Other option:
istrstream
is good for this, istringstream
is not).It really depends on your system. A modern system will generally read ahead; seeking within the file is likely to inhibit this, so should definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to /dev/null
is usually sufficient for this.
That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single ifstream
(and thus a single file descriptor), and constantly
seeking, won, but you never know.
I'd also recommend system specific solutions like mmap
, but if you've
got that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use mmap
, mapping sections of it
at a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into smaller files? That might be the fastest solution of all. (Ideally, this would be done where the data is generated or imported into the system.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With