Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change buffer size with boost::iostreams?

Tags:

c++

boost

My program reads dozens of very large files in parallel, just one line at a time. It seems like the major performance bottleneck is HDD seek time from file to file (though I'm not completely sure how to verify this), so I think it would be faster if I could buffer the input.

I'm using C++ code like this to read my files through boost::iostreams "filtering streams":

input = new filtering_istream;
input->push(gzip_decompressor());
file_source in (fname);
input->push(in);

According to the documentation, file_source does not have any way to set the buffer size but filtering_stream::push seems to:

void push( const T& t,
  std::streamsize buffer_size,
  std::streamsize pback_size );

So I tried input->push(in, 1E9) and indeed my program's memory usage shot up, but the speed didn't change at all.

Was I simply wrong that read buffering would improve performance? Or did I do this wrong? Can I buffer a file_source directly, or do I need to create a filtering_streambuf? If the latter, how does that work? The documentation isn't exactly full of examples.

like image 958
user387250 Avatar asked Jul 08 '10 23:07

user387250


1 Answers

You should profile it too see where the bottleneck is.

Perhaps it's in the kernel, perhaps your at your hardware's limit. Until you profile it to find out you're stumbling in the dark.

EDIT:

Ok, a more thorough answer this time, then. According to the Boost.Iostreams documentation basic_file_source is just a wrapper around std::filebuf, which in turn is built on std::streambuf. To quote the documentation:

CopyConstructible and Assignable wrapper for a std::basic_filebuf opened in read-only mode.

streambuf does provide a method pubsetbuf (not the best reference perhaps, but the first google turned up) which you can, apparently, use to control the buffer size.

For example:

#include <fstream>

int main()
{
  char buf[4096];
  std::ifstream f;
  f.rdbuf()->pubsetbuf(buf, 4096);
  f.open("/tmp/large_file", std::ios::binary);

  while( !f.eof() )
  {
      char rbuf[1024];
      f.read(rbuf, 1024);
  }

  return 0;
}

In my test (optimizations off, though) I actually got worse performance with a 4096 bytes buffer than a 16 bytes buffer but YMMV -- a good example of why you should always profile first :)

But, as you say, the basic_file_sink does not provide any means to access this as it hides the underlying filebuf in its private part.

If you think this is wrong you could:

  1. Urge the Boost developers to expose such functionality, use the mailing list or the trac.
  2. Build your own filebuf wrapper which does expose the buffer size. There's a section in the tutorial which explains writing custom sources that might be a good starting point.
  3. Write a custom source based on whatever, that does all the caching you fancy.

Remember that your hard drive as well as the kernel already does caching and buffering on file reads, which I don't think that you'll get much of a performance increase from caching even more.

And in closing, a word on profiling. There's a ton of powerful profiling tools available for Linux an I don't even know half of them by name, but for example there's iotop which is kind of neat because it's super simple to use. It's pretty much like top but instead shows disk related metrics. For example:

Total DISK READ: 31.23 M/s | Total DISK WRITE: 109.36 K/s
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND          
19502 be/4 staffan    31.23 M/s    0.00 B/s  0.00 % 91.93 % ./apa

tells me that my progam spends over 90% of its time waiting for IO, i.e. it's IO bound. If you need something more powerful I'm sure google can help you.

And remember that benchmarking on a hot or cold cache greatly affects the outcome.

like image 121
Staffan Avatar answered Nov 04 '22 22:11

Staffan