I've always assumed it to be more efficient, when processing text files, to first read the contents (or part of it) into an std::string or char array, as — from my limited understanding — files are read from memory in blocks much larger than the size of a single character. However, I've heard that modern OS's are often not actually directly reading from the file anyway, making my manually buffering the input little benefit.
Say I wanted to determine the number of a certain character in a text file. Would the following be inefficient?
while (fin.get(ch)) {
if (ch == 'n')
++char_count;
}
Granted, I guess it would depend on file size, but does anyone have any general rules about what's the best approach?
A great deal here depends on exactly how critical performance really is for you/your application. That, in turn tends to depend upon how large of files you're dealing with -- if you're dealing with something like tens or hundreds of kilobytes, you should generally just write the simplest code that will work, and not worry much about it -- anything you can do is going to be essentially instantaneous, so optimizing the code won't really accomplish much.
On the other hand, if you're processing a lot of data -- on the order of tens of megabytes or more, differences in efficiency can become fairly substantial. Unless you take fairly specific steps to bypass it (such as using read
) all your reads are going to be buffered -- but that does not mean they'll all be the same speed (or necessarily even very close to the same speed).
For example, let's try a quick test of a few different methods for doing essentially what you've asked about:
#include <stdio.h>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <fstream>
#include <time.h>
#include <string>
#include <algorithm>
unsigned count1(FILE *infile, char c) {
int ch;
unsigned count = 0;
while (EOF != (ch=getc(infile)))
if (ch == c)
++count;
return count;
}
unsigned int count2(FILE *infile, char c) {
static char buffer[4096];
int size;
unsigned int count = 0;
while (0 < (size = fread(buffer, 1, sizeof(buffer), infile)))
for (int i=0; i<size; i++)
if (buffer[i] == c)
++count;
return count;
}
unsigned count3(std::istream &infile, char c) {
return std::count(std::istreambuf_iterator<char>(infile),
std::istreambuf_iterator<char>(), c);
}
unsigned count4(std::istream &infile, char c) {
return std::count(std::istream_iterator<char>(infile),
std::istream_iterator<char>(), c);
}
template <class F, class T>
void timer(F f, T &t, std::string const &title) {
unsigned count;
clock_t start = clock();
count = f(t, 'N');
clock_t stop = clock();
std::cout << std::left << std::setw(30) << title << "\tCount: " << count;
std::cout << "\tTime: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}
int main() {
char const *name = "test input.txt";
FILE *infile=fopen(name, "r");
timer(count1, infile, "ignore");
rewind(infile);
timer(count1, infile, "using getc");
rewind(infile);
timer(count2, infile, "using fread");
fclose(infile);
std::ifstream in2(name);
in2.sync_with_stdio(false);
timer(count3, in2, "ignore");
in2.clear();
in2.seekg(0);
timer(count3, in2, "using streambuf iterators");
in2.clear();
in2.seekg(0);
timer(count4, in2, "using stream iterators");
return 0;
}
I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:
ignore Count: 400000 Time: 2.08
using getc Count: 400000 Time: 2.034
using fread Count: 400000 Time: 0.257
ignore Count: 400000 Time: 0.607
using streambuf iterators Count: 400000 Time: 0.608
using stream iterators Count: 400000 Time: 5.136
Using the same input, but compiled with g++ 4.7.1:
ignore Count: 400000 Time: 0.359
using getc Count: 400000 Time: 0.339
using fread Count: 400000 Time: 0.243
ignore Count: 400000 Time: 0.697
using streambuf iterators Count: 400000 Time: 0.694
using stream iterators Count: 400000 Time: 1.612
So, even though all the reads are buffered, we're seeing a variation of about 8:1 with g++ and about 20:1 with VC++. Of course, I haven't tested (even close to) every possible way of reading the input. I doubt we'd see a lot wider range of times if we tested more techniques for reading, but I could be wrong about that. Whether we do or not, we're seeing enough variation that at least if you're processing a lot of data, you could well be justified in choosing one technique over another to improve processing speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With