Fancy way to read a file in C++ : strange performance issue

Question

The usual way to read a file in C++ is this one:

std::ifstream file("file.txt", std::ios::binary | std::ios::ate);
std::vector<char> data(file.tellg());
file.seekg(0, std::ios::beg);
file.read(data.data(), data.size());

Reading a 1.6 MB file is almost instant.

But recently, I discovered std::istream_iterator and wanted to try it in order to code a beautiful one-line way to read the content of a file. Like this:

std::vector<char> data(std::istream_iterator<char>(std::ifstream("file.txt", std::ios::binary)), std::istream_iterator<char>());

The code is nice, but very slow. It takes about 2/3 seconds to read the same 1.6 MB file. I understand that it may not be the best way to read a file, but why is it so slow?

Reading a file in a classical way goes like this (I'm talking only about the read function):

the istream contains a filebuf which contains a block of data from the file
the read function calls sgetn from the filebuf, which copies the chars one by one (no memcpy) from the inside buffer to "data"'s buffer
when the data inside of the filebuf is entirely read, the filebuf reads the next block from the file

When you read a file using istream_iterator, it goes like this:

the vector calls *iterator to get the next char (this simply reads a variable), adds it to the end and increases its own size
if the vector's allocated space is full (which happens not so often), a relocation is performed
then it calls ++iterator which reads the next char from the stream (operator >> with a char parameter, which certainly just calls the filebuf's sbumpc function)
finally it compares the iterator with the end iterator, which is done by comparing two pointers

I must admit that the second way is not very efficient, but it's at least 200 times slower than the first way, how is that possible?

I thought that the performance killer was the relocations or the insert, but I tried creating an entire vector and calling std::copy, and it's just as slow.

// also very slow:
std::vector<char> data2(1730608);
std::copy(std::istream_iterator<char>(std::ifstream("file.txt", std::ios::binary)), std::istream_iterator<char>(), data2.begin());

Thomas Petit · Accepted Answer

You should compare apple-to-apple.

Your first code read unformatted binary data because you use the function member "read". And not because you use std::ios_binary by the way, see http://stdcxx.apache.org/doc/stdlibug/30-4.html for more explication, but in short : "The effect of the binary open mode is frequently misunderstood. It does not put the inserters and extractors into a binary mode, and hence suppress the formatting they usually perform. Binary input and output is done solely by basic_istream<>::read() and basic_ostream<>::write()"

So your second code with istream_iterator read formatted text. It's way slower.

If you want to read unformatted binary data, use istreambuf_iterator :

#include <fstream>
#include <vector>
#include <iterator>

std::ifstream file( "file.txt", std::ios::binary);
std::vector<char> buffer((std::istreambuf_iterator<char>(file)),
                          std::istreambuf_iterator<char>());

On my platform (VS2008), istream_iterator is about x100 slower than read(). istreambuf_iterator performs better, but still x10 slower than read().

Gianni · Answer

Only profiling will tell you why exactly. My guess would be that what you are seeing is just the overhead of all of the extra function calls associated with the second method. Instead of a single call to bring in all the data, you are doing 1.6M calls*... or something along those lines.

* Many of them are virtual which means two CPU cycles per call. (Tks Zan)

Roddy · Answer

The iterator approach reads the file one character at a time, while the file.read does it in a single hit.

If the operating system/file handlers know you want to read a large amount of data, there's lots of optimizations that can be done - maybe reading the whole file on a single revolution of the disk spindle, not copying data from OS buffers to application buffers.

When you do byte-by-byte transfers, the OS has no clue what you're really wanting to do, so cannot perform such optimizations.

Fancy way to read a file in C++ : strange performance issue

Tags:

c++

performance

iterator

file

Tomaka17

3 Answers

Thomas Petit

Gianni

Roddy

Recent Activity

Donate For Us

Fancy way to read a file in C++ : strange performance issue

Tags:

c++

performance

iterator

file

Tomaka17

3 Answers

Thomas Petit

Gianni

Roddy

Related questions

Recent Activity

Donate For Us