I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use <code>getline()</code>? My question is does <code>getline()</code> reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (<code>getline</code>) will be buffered enough that you won't notice a significant difference. However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found. If you want to know specifically which method is faster and by how much, you'll have to profile your code.

getline while reading a file vs reading whole file and then splitting based on newline character

Tags:

c++

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

657

asked Jan 22 '13 16:01

psyche

3 Answers

I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.

However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.

If you want to know specifically which method is faster and by how much, you'll have to profile your code.

152

answered Oct 07 '22 00:10

Mark B

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

But the "load the whole file in one go" has several, distinct, problems:

You don't start processing until you have read the whole file.
You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?

Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

Edit: So, I wrote a program to measure this, since I think it's quite interesting.

And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

I also decided to measure the system and user-time by the process.

$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)

Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

void func_readwhole(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    f.seekg(0, ios::end);
    streampos size = f.tellg();

    f.seekg(0, ios::beg);

    char* buffer = new char[size];
    f.read(buffer, size);
    if (f.gcount() != size)
    {
        cerr << "Read failed ...\n";
        exit(1);
    }

    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    f.close();


    cout << "Lines=" << lines << endl;

    delete [] buffer;
}

void func_getline(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    string str;
    int lines = 0;

    while(getline(f, str))
    {
        lines++;
    }

    cout << "Lines=" << lines << endl;

    f.close();
}

void func_mmap(const char *name)
{
    char *buffer;

    string fullname = string("bigfile_") + name;
    int f = open(fullname.c_str(), O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);

    lseek(f, 0, SEEK_SET);

    buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);


    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    munmap(buffer, size);
    cout << "Lines=" << lines << endl;
}

answered Oct 19 '22 06:10

Mats Petersson

The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.

answered Oct 19 '22 05:10

Floris

Related questions
                            
                                How do template aliases affect template parameter deduction?
                            
                                Why is this simple assignment undefined behaviour?
                            
                                Implementing interfaces in C++ with inherited concrete classes
                            
                                Check at Compile-Time if Template Argument is void
                            
                                How to combine C++ strings and Arduino Strings?
                            
                                Does std::bind work with move-only types in general, and std::unique_ptr in particular?
                            
                                Rules for implicit conversion of template arguments
                            
                                Difference between gdb addresses and "real" addresses?
                            
                                How to make CDT/Eclipse work with C++11 threads?
                            
                                What is this string-literal in boolean expression idiom?
                            
                                C++ variadic template unusual example
                            
                                boost::filesystem::path and fopen()
                            
                                assignment of class with const member
                            
                                Why class member functions shadow free functions with same name?
                            
                                Is it possible to programmatically convert SQLite database to SQL statements in C/C++?
                            
                                Linking library without a header file?
                            
                                error LNK2019: unresolved external symbol
                            
                                How to Insert a LLVM Instruction?
                            
                                Where is the loop-carried dependency here?
                            
                                DllMain not called after using LoadLibrary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With