std::ifstream buffer caching

Question

In my application I'm trying to merge sorted files (keeping them sorted of course), so I have to iterate through each element in both files to write the minimal to the third one. This works pretty much slow on big files, as far as I don't see any other choice (the iteration has to be done) I'm trying to optimize file loading. I can use some amount of RAM, which I can use for buffering. I mean instead of reading 4 bytes from both files every time I can read once something like 100Mb and work with that buffer after that, until there will be no element in buffer, then I'll refill the buffer again. But I guess ifstream is already doing that, will it give me more performance and is there any reason? If fstream does, maybe I can change size of that buffer?

added

My current code looks like that (pseudocode)

// this is done in loop
int i1 = input1.read_integer();
int i2 = input2.read_integer();
if (!input1.eof() && !input2.eof())
{
   if (i1 < i2)
   {
      output.write(i1);
      input2.seek_back(sizeof(int));
   } else
      input1.seek_back(sizeof(int));
      output.write(i2);
   }
} else {
   if (input1.eof())
      output.write(i2);
   else if (input2.eof())
      output.write(i1);
}

What I don't like here is

seek_back - I have to seek back to previous position as there is no way to peek 4 bytes
too much reading from file
if one of the streams is in EOF it still continues to check that stream instead of putting contents of another stream directly to output, but this is not a big issue, because chunk sizes are almost always equal.

Can you suggest improvement for that?

Thanks.

davka · Accepted Answer

Without getting into the discussion on stream buffers, you can get rid of the seek_back and generally make the code much simpler by doing:

using namespace std;
merge(istream_iterator<int>(file1), istream_iterator<int>(),
           istream_iterator<int>(file2), istream_iterator<int>(),
           ostream_iterator<int>(cout));

Edit:

Added binary capability

#include <algorithm>
#include <iterator>
#include <fstream>
#include <iostream>

struct BinInt
{
    int value;
    operator int() const { return value; }
    friend std::istream& operator>>(std::istream& stream, BinInt& data)
    {
        return stream.read(reinterpret_cast<char*>(&data.value),sizeof(int));
    }
};

int main()
{
    std::ifstream   file1("f1.txt");
    std::ifstream   file2("f2.txt");

    std::merge(std::istream_iterator<BinInt>(file1), std::istream_iterator<BinInt>(),
               std::istream_iterator<BinInt>(file2), std::istream_iterator<BinInt>(),
               std::ostream_iterator<int>(std::cout));
}

Ben Voigt · Answer

In decreasing order of performance (best first):

memory-mapped I/O
OS-specific ReadFile or read calls.
fread into a large buffer
ifstream.read into a large buffer
ifstream and extractors

Mike Dunlavey · Answer

A program like this should be I/O bound, meaning it should be spending at least 80% of it's time waiting for completion of reading or writing a buffer, and if the buffers are reasonably big, it should be keeping the disk heads busy. That's what you want.

Don't assume it is I/O bound, without proof. A way to prove it is by taking several stackshots. If it is, most of the samples will show the program waiting for I/O completion.

It is possible that it is not I/O bound, meaning you may find other things going on in some of the samples that you never expected. If so, then you know what to fix to speed it up. I have seen some code like this spending much more time than necessary in the merge loop, testing for end-of-file, getting data to compare, etc. for example.

std::ifstream buffer caching

Tags:

c++

performance

file

sorting

buffering

ledokol

3 Answers

Edit:

davka

Ben Voigt

Mike Dunlavey

Recent Activity

Donate For Us

std::ifstream buffer caching

Tags:

c++

performance

file

sorting

buffering

ledokol

3 Answers

Edit:

davka

Ben Voigt

Mike Dunlavey

Related questions

Recent Activity

Donate For Us