Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal way of reading a complete file to a string using fstream?

Tags:

c++

file

input

Many other posts, like " Read whole ASCII file into C++ std::string " explain what some of the options are but do not describe pro and cons of various methods in any depth. I want to know why one method is preferable over another?

All of these use std::fstream to read the file into a std::string. I am unsure what the costs and benefits of each method. Lets assume this is for the common case where the read files are known to be of some smallish size memory can easily accommodate, clearly reading a multi-terrabyte file into an memory is a bad idea no matter how you do it.

The most common way after a few googles searches to read a whole file into an std::string involves using std::getline and appending a newline character to it after each line. This seems needless to me, but is there some performance or compatibility reason that this is ideal?

std::string Results;
std::ifstream ResultReader("file.txt");    
while(ResultReader)
{
    std::getline(ResultReader, Results);
    Results.push_back('\n');
}

Another way I pieced together is to change the getline delimiter so it is something not in the file. The EOF char is seems unlikely to be in the middle of the file so that seems a likely candidate. This includes a cast so there is at least one reason not to do it, but this does read a file at once with no string concatenation. Presumably there is still some cost for the delimiter checks. Are there any other good reasons not to do this?

std::string Results;
std::ifstream ResultReader("file.txt");
std::getline(ResultReader, Results, (char)std::char_traits<char>::eof());

The cast means that on systems that define std::char_traits::eof() as something other than -1 might have problems. Is this a practical reason to not choose this over other methods that use std::getline and string::push_pack('\n').

How does these compare to other ways of reading the file at once like in this question: Read whole ASCII file into C++ std::string

std::ifstream ResultReader("file.txt");
std::string Results((std::istreambuf_iterator<char>(ResultReader)),
                     std::istreambuf_iterator<char>());

It would seem this would be best. It offloads almost all the work onto the standard library which ought to be heavily optimized for the given platform. I see no reason for checks other than stream validity and the end of the file. Is this ideal or are there problems with this that are unseen.

Does the standard or do details of some implementation provide reasons to prefer some method over another? Have I missed some method that might prove ideal in a wide variety of circumstances?

What is a simplest, most idiomatic, best performing and standard compliant way of reading a whole file into an std::string?

EDIT - 2 This question has prompted me to write a small suite of benchmarks. They are MIT license and available on github at: https://github.com/Sqeaky/CppFileToStringExperiments

Fastest - TellSeekRead and CTellSeekRead- These have the system provide an easy to get the size and reads the file in one go.

Faster - Getline Appending and Eof - The checking of chars does not seem to impose any cost.

Fast - RdbufMove and Rdbuf - The std::move seems to make no difference in release.

Slow - Iterator, BackInsertIterator and AssignIterator - Something is wrong with iterators and input streams. The work great in memory, but not here. That said some of these are faster than others.

I have added every method suggested so far, including those in links. I would appreciate if someone could run this on windows and with other compilers. I currently do not have access to a machine with NTFS and it has been noted that this and compiler details could be important.

As for measuring simplicity and idiomatic-ness how do we measure these objectively? Simplicity seems doable, perhaps use something line LOCs and Cyclomatic complexity, but how idiomatic something is seems purely subjective.

like image 574
Sqeaky Avatar asked Aug 23 '15 18:08

Sqeaky


3 Answers

What is a simplest, most idiomatic, best performing and standard compliant way of reading a whole file into an std::string?

those are pertty much contradicting requests, one most likely to lessen the other. simpler code won't be the fastest, or more idiomatic.

after exploring this area for a while I've come to some conclusions:
1) the most performance penalty causing is the IO action itself - the less IO actions taken - the fastest the code
2) memory allocations also quite expensive, but not as expensive as the IO
3) reading as binary is faster than reading as text
4) using the OS API will probably be faster than C++ streams
5) std::ios_base::sync_with_stdio doesn't really effect the performence, it's an urban legend.

using std::getline is probably not the best choice if performence is needed because of these reasons: it will make N IO actions and N allocations for N lines.

A compromise which is fast, standard and elegant is to get the file size, allocate all the memory in one time, then reading the file in one time:

std::ifstream fileReader(<your path here>,std::ios::binary|std::ios::ate);
if (fileReader){
  auto fileSize = fileReader.tellg();
  fileReader.seekg(std::ios::beg);
  std::string content(fileSize,0);
  fileReader.read(&content[0],fileSize);
}   

move the content around to prevent un-needed copies.

like image 90
David Haim Avatar answered Oct 05 '22 17:10

David Haim


This website has a good comparison on several different methods for doing that. The one I currently use is:

std::string read_sequence() {
    std::ifstream f("sequence.fasta");
    std::ostringstream ss;
    ss << f.rdbuf();
    return ss.str();
}

If your text files are separated by newlines, this will keep them. If you want to remove that, for instance (which is my case most of the times), you can just add a call to something such as

auto s = ss.str();
s.erase(std::remove_if(s.begin(), s.end(), 
        [](char c) { return c == '\n'; }), s.end());
like image 32
LLLL Avatar answered Oct 05 '22 18:10

LLLL


There are two big difficulties with your question. First, the Standard doesn't mandate any particular implementation (yes, nearly everybody started with the same implementation; but they've been modifying it over time, and the optimal I/O code for NTFS, say, will be different than the optimal I/O code for ext4), so it is possible (although somewhat unlikely) for a particular approach to be fastest on one platform, but not another. Second, there's a little difficulty in defining "optimal"; I assume you mean "fastest," but that's not necessarily the case.

There are approaches that are idiomatic, and perfectly fine C++, but unlikely to give wonderful performance. If your goal is to end up with a single std::string, using std::getline(std::ostream&, std::string&) very likely to be slower than necessary. The std::getline() call has to look for the '\n', and you'll occasionally reallocate and copy the destination std::string. Even so, it's ridiculously simple, and easy to understand. That could be optimal from a maintenance perspective, assuming you don't need the absolute fastest performance possible. This will also be a good approach if you don't need the whole file in one giant std::string at one time. You'll be very frugal with memory.

An approach that is likely more efficient is to manipulate the read buffer:

std::string read_the_whole_file(std::ostream& ostr)
{
    std::ostringstream sstr;
    sstr << ostr.rdbuf();
    return sstr.str();
}

Personally, I'm just as likely to use std::fopen() and std::fread() (and std::unique_ptr<FILE>) because, on Windows at least, you'll get a better error message when std::fopen() fails than when constructing a file stream object fails. I consider the better error message an important factor when deciding which approach is optimal.

like image 42
Max Lybbert Avatar answered Oct 05 '22 17:10

Max Lybbert