Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

getline while reading a file vs reading whole file and then splitting based on newline character

Tags:

c++

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

like image 657
psyche Avatar asked Jan 22 '13 16:01

psyche


People also ask

Does Getline include the newline character?

If successful, getline() returns the number of characters that are read, including the newline character, but not including the terminating null byte ( '\0' ). This value can be used to handle embedded null bytes in the line read.

How do I ignore a new line in Getline?

ignore() function does the trick. By default, it discards all the input suquences till new line character. Other dilimiters and char limit can be specified as well.

Which is used to split the line read from a file stream?

Which is used to split the line read from a file stream? C++ Read file line by line then split each line using the delimiter.

What does the Getline function do in C++?

What is C++ Getline? The C++ getline() is an in-built function defined in the <string. h> header file that allows accepting and reading single and multiple line strings from the input stream. In C++, the cin object also allows input from the user, but not multi-word or multi-line input.


3 Answers

I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.

However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.

If you want to know specifically which method is faster and by how much, you'll have to profile your code.

like image 152
Mark B Avatar answered Oct 07 '22 00:10

Mark B


getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

But the "load the whole file in one go" has several, distinct, problems:

  1. You don't start processing until you have read the whole file.
  2. You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?

Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

Edit: So, I wrote a program to measure this, since I think it's quite interesting.

And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

I also decided to measure the system and user-time by the process.

$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)

Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

void func_readwhole(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    f.seekg(0, ios::end);
    streampos size = f.tellg();

    f.seekg(0, ios::beg);

    char* buffer = new char[size];
    f.read(buffer, size);
    if (f.gcount() != size)
    {
        cerr << "Read failed ...\n";
        exit(1);
    }

    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    f.close();


    cout << "Lines=" << lines << endl;

    delete [] buffer;
}

void func_getline(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    string str;
    int lines = 0;

    while(getline(f, str))
    {
        lines++;
    }

    cout << "Lines=" << lines << endl;

    f.close();
}

void func_mmap(const char *name)
{
    char *buffer;

    string fullname = string("bigfile_") + name;
    int f = open(fullname.c_str(), O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);

    lseek(f, 0, SEEK_SET);

    buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);


    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    munmap(buffer, size);
    cout << "Lines=" << lines << endl;
}
like image 6
Mats Petersson Avatar answered Oct 19 '22 06:10

Mats Petersson


The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.

like image 3
Floris Avatar answered Oct 19 '22 05:10

Floris