Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the real reason to not use the EOF bit as our stream extraction condition?

Inspired by my previous question

A common mistake for new C++ programmers is to read from a file with something along the lines of:

std::ifstream file("foo.txt");
std::string line;
while (!file.eof()) {
  file >> line;
  // Do something with line
}

They will often report that the last line of the file was read twice. The common explanation for this problem (one that I have given before) goes something like:

The extraction will only set the EOF bit on the stream if you attempt to extract the end-of-file, not if your extraction just stops at the end-of-file. file.eof() will only tell you if the previous read hit the end-of-file and not if the next one will. After the last line has been extracted, the EOF bit is still not set and the iteration occurs one more time. However, on this last iteration, the extraction fails and line still has the same content as before, i.e. the last line is duplicated.

However, the first sentence of this explanation is wrong and so the explanation of what the code is doing is also wrong.

The definition of formatted input functions (which operator>>(std::string&) is) defines extraction as using rdbuf()->sbumpc() or rdbuf()->sgetc() to obtain input characters. It states that if either of these functions returns traits::eof(), then the EOF bit is set:

If rdbuf()->sbumpc() or rdbuf()->sgetc() returns traits::eof(), then the input function, except as explicitly noted otherwise, completes its actions and does setstate(eofbit), which may throw ios_base::failure (27.5.5.4), before returning.

We can see this with the simple example that uses a std::stringstream rather than a file (they are both input streams and behave the same way when extracting):

int main(int argc, const char* argv[])
{
  std::stringstream ss("hello");
  std::string result;
  ss >> result;
  std::cout << ss.eof() << std::endl; // Outputs 1
  return 0;
}

It's clear here that the single extraction obtains hello from the string and sets the EOF bit to 1.

So what's wrong with the explanation? What's different about files that causes !file.eof() to cause the last line to be duplicated? What's the real reason we shouldn't use !file.eof() as our extraction condition?

like image 927
Joseph Mansfield Avatar asked Jan 30 '13 23:01

Joseph Mansfield


2 Answers

Yes, extracting from an input stream will set the EOF bit if the extraction stops at the end-of-file, as demonstrated by the std::stringstream example. If it were this simple, the loop with !file.eof() as its condition would work just fine on a file like:

hello
world

The second extraction would eat world, stopping at the end-of-file, and consequently setting the EOF bit. The next iteration wouldn't occur.

However, many text editors have a dirty secret. They're lying to you when you save a text file even as simple as that. What they don't tell you is that there's a hidden \n at the end of the file. Every line in the file ends with a \n, including the last one. So the file actually contains:

hello\nworld\n

This is what causes the last line to be duplicated when using !file.eof() as the condition. Now that we know this, we can see that the second extraction will eat world stopping at \n and not setting the EOF bit (because we haven't gotten there yet). The loop will iterate for a third time but the next extraction will fail because it doesn't find a string to extract, only whitespace. The string is left with its previous value still hanging around and so we get the duplicated line.

You don't experience this with std::stringstream because what you stick in the stream is exactly what you get. There's no \n at the end of std::stringstream ss("hello"), unlike in the file. If you were to do std::stringstream ss("hello\n"), you'd experience the same duplicate line issue.

So of course, we can see that we should never use !file.eof() as the condition when extracting from a text file - but what's the real issue here? Why should we really never use that as our condition, regardless of whether we're extracting from a file or not?

The real problem is that eof() gives us no idea whether the next read will fail or not. In the above case, we saw that even though eof() was 0, the next extraction failed because there was no string to extract. The same situation would happen if we didn't associate a file stream with any file or if the stream was empty. The EOF bit wouldn't be set but there's nothing to read. We can't just blindly go ahead and extract from the file just because eof() isn't set.

Using while (std::getline(...)) and related conditions works perfectly because just before the extraction starts, the formatted input function checks if any of the bad, fail, or EOF bits are set. If any of them are, it immediately ends, setting the fail bit in the process. It will also fail if it finds the end-of-file before it finds what it wants to extract, setting both the eof and fail bits.


Note: You can save a file without the extra \n in vim if you do :set noeol and :set binary before saving.

like image 53
Joseph Mansfield Avatar answered Oct 15 '22 09:10

Joseph Mansfield


Your question has some bogus conceptions. You give an explanation:

"The extraction will only set the EOF bit on the stream if you attempt to extract the end-of-file, not if your extraction just stops at the end-of-file."

Then claim it "is wrong and so the explanation of what the code is doing is also wrong."

Actually, it's right. Let's look at an example....

When reading into a std::string...

std::istringsteam iss('abc\n');
std::string my_string;
iss >> my_string;

...by default and as in your question operator>> is reading characters until it finds whitespace or EOF. So:

  • reading from 'abc\n' -> once the '\n' is encountered it doesn't "attempt to extract the end-of-file", rather it "just stops at [EOF]", and eof() won't return true,
  • reading from 'abc' instead -> it's the attempt to extract the end-of-file that discovers the end of the the string content, so eof() will return true.

Similarly, parsing '123' into an int sets eof() because the parsing doesn't know if there will be another digit and tries to keep reading them, hitting eof(). Parsing '123 ' to an int won't set eof().

Crucially, parsing 'a' into a char won't set eof() because trailing whitespace isn't needed to know that the parsing is complete - once a character is read no attempt is made to find another character and the eof() isn't encountered. (Of course further parsing from the same stream hits eof).

It's clear [for stringstream "hello" >> std::string] that the single extraction obtains hello from the string and sets the EOF bit to 1. So what's wrong with the explanation? What's different about files that causes !file.eof() to cause the last line to be duplicated? What's the real reason we shouldn't use !file.eof() as our extraction condition?

The reason is as above... that files tend to be terminated by a '\n' character, and when they are means getline or >> std::string return the last non-whitespace token without needing to "attempt to extract the end-of-file" (to use your phrase).

like image 27
Tony Delroy Avatar answered Oct 15 '22 10:10

Tony Delroy