My program reads dozens of very large files in parallel, just one line at a time. It seems like the major performance bottleneck is HDD seek time from file to file (though I'm not completely sure how to verify this), so I think it would be faster if I could buffer the input.
I'm using C++ code like this to read my files through boost::iostreams "filtering streams":
input = new filtering_istream;
input->push(gzip_decompressor());
file_source in (fname);
input->push(in);
According to the documentation, file_source
does not have any way to set the buffer size but filtering_stream::push seems to:
void push( const T& t,
std::streamsize buffer_size,
std::streamsize pback_size );
So I tried input->push(in, 1E9)
and indeed my program's memory usage shot up, but the speed didn't change at all.
Was I simply wrong that read buffering would improve performance? Or did I do this wrong? Can I buffer a file_source directly, or do I need to create a filtering_streambuf? If the latter, how does that work? The documentation isn't exactly full of examples.
You should profile it too see where the bottleneck is.
Perhaps it's in the kernel, perhaps your at your hardware's limit. Until you profile it to find out you're stumbling in the dark.
EDIT:
Ok, a more thorough answer this time, then. According to the Boost.Iostreams documentation basic_file_source
is just a wrapper around std::filebuf
, which in turn is built on std::streambuf
. To quote the documentation:
CopyConstructible and Assignable wrapper for a std::basic_filebuf opened in read-only mode.
streambuf
does provide a method pubsetbuf (not the best reference perhaps, but the first google turned up) which you can, apparently, use to control the buffer size.
For example:
#include <fstream>
int main()
{
char buf[4096];
std::ifstream f;
f.rdbuf()->pubsetbuf(buf, 4096);
f.open("/tmp/large_file", std::ios::binary);
while( !f.eof() )
{
char rbuf[1024];
f.read(rbuf, 1024);
}
return 0;
}
In my test (optimizations off, though) I actually got worse performance with a 4096 bytes buffer than a 16 bytes buffer but YMMV -- a good example of why you should always profile first :)
But, as you say, the basic_file_sink
does not provide any means to access this as it hides the underlying filebuf
in its private part.
If you think this is wrong you could:
filebuf
wrapper which does expose the buffer size. There's a section in the tutorial which explains writing custom sources that might be a good starting point.Remember that your hard drive as well as the kernel already does caching and buffering on file reads, which I don't think that you'll get much of a performance increase from caching even more.
And in closing, a word on profiling. There's a ton of powerful profiling tools available for Linux an I don't even know half of them by name, but for example there's iotop which is kind of neat because it's super simple to use. It's pretty much like top but instead shows disk related metrics. For example:
Total DISK READ: 31.23 M/s | Total DISK WRITE: 109.36 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
19502 be/4 staffan 31.23 M/s 0.00 B/s 0.00 % 91.93 % ./apa
tells me that my progam spends over 90% of its time waiting for IO, i.e. it's IO bound. If you need something more powerful I'm sure google can help you.
And remember that benchmarking on a hot or cold cache greatly affects the outcome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With