Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing individual characters in a file inefficient? (C++)

I've always assumed it to be more efficient, when processing text files, to first read the contents (or part of it) into an std::string or char array, as — from my limited understanding — files are read from memory in blocks much larger than the size of a single character. However, I've heard that modern OS's are often not actually directly reading from the file anyway, making my manually buffering the input little benefit.

Say I wanted to determine the number of a certain character in a text file. Would the following be inefficient?

while (fin.get(ch)) {
    if (ch == 'n')
        ++char_count;
}

Granted, I guess it would depend on file size, but does anyone have any general rules about what's the best approach?

like image 947
user1892721 Avatar asked Dec 10 '12 20:12

user1892721


1 Answers

A great deal here depends on exactly how critical performance really is for you/your application. That, in turn tends to depend upon how large of files you're dealing with -- if you're dealing with something like tens or hundreds of kilobytes, you should generally just write the simplest code that will work, and not worry much about it -- anything you can do is going to be essentially instantaneous, so optimizing the code won't really accomplish much.

On the other hand, if you're processing a lot of data -- on the order of tens of megabytes or more, differences in efficiency can become fairly substantial. Unless you take fairly specific steps to bypass it (such as using read) all your reads are going to be buffered -- but that does not mean they'll all be the same speed (or necessarily even very close to the same speed).

For example, let's try a quick test of a few different methods for doing essentially what you've asked about:

#include <stdio.h>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <fstream>
#include <time.h>
#include <string>
#include <algorithm>

unsigned count1(FILE *infile, char c) { 
    int ch;
    unsigned count = 0;

    while (EOF != (ch=getc(infile)))
        if (ch == c)
            ++count;
    return count;
}

unsigned int count2(FILE *infile, char c) { 
    static char buffer[4096];
    int size;
    unsigned int count = 0;

    while (0 < (size = fread(buffer, 1, sizeof(buffer), infile)))
        for (int i=0; i<size; i++)
            if (buffer[i] == c)
                ++count;
    return count;
}

unsigned count3(std::istream &infile, char c) {    
    return std::count(std::istreambuf_iterator<char>(infile), 
                    std::istreambuf_iterator<char>(), c);
}

unsigned count4(std::istream &infile, char c) {    
    return std::count(std::istream_iterator<char>(infile), 
                    std::istream_iterator<char>(), c);
}

template <class F, class T>
void timer(F f, T &t, std::string const &title) { 
    unsigned count;
    clock_t start = clock();
    count = f(t, 'N');
    clock_t stop = clock();
    std::cout << std::left << std::setw(30) << title << "\tCount: " << count;
    std::cout << "\tTime: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}

int main() {
    char const *name = "test input.txt";

    FILE *infile=fopen(name, "r");

    timer(count1, infile, "ignore");

    rewind(infile);
    timer(count1, infile, "using getc");

    rewind(infile);
    timer(count2, infile, "using fread");

    fclose(infile);

    std::ifstream in2(name);
    in2.sync_with_stdio(false);
    timer(count3, in2, "ignore");

    in2.clear();
    in2.seekg(0);
    timer(count3, in2, "using streambuf iterators");

    in2.clear();
    in2.seekg(0);
    timer(count4, in2, "using stream iterators");

    return 0;
}

I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:

ignore                          Count: 400000   Time: 2.08
using getc                      Count: 400000   Time: 2.034
using fread                     Count: 400000   Time: 0.257
ignore                          Count: 400000   Time: 0.607
using streambuf iterators       Count: 400000   Time: 0.608
using stream iterators          Count: 400000   Time: 5.136

Using the same input, but compiled with g++ 4.7.1:

ignore                          Count: 400000   Time: 0.359
using getc                      Count: 400000   Time: 0.339
using fread                     Count: 400000   Time: 0.243
ignore                          Count: 400000   Time: 0.697
using streambuf iterators       Count: 400000   Time: 0.694
using stream iterators          Count: 400000   Time: 1.612

So, even though all the reads are buffered, we're seeing a variation of about 8:1 with g++ and about 20:1 with VC++. Of course, I haven't tested (even close to) every possible way of reading the input. I doubt we'd see a lot wider range of times if we tested more techniques for reading, but I could be wrong about that. Whether we do or not, we're seeing enough variation that at least if you're processing a lot of data, you could well be justified in choosing one technique over another to improve processing speed.

like image 157
Jerry Coffin Avatar answered Oct 17 '22 10:10

Jerry Coffin