Efficiently reading a very large text file in C++

My Code in C++:

void process_data(string str)
{
    vector<string> arr;
    boost::split(arr, str, boost::is_any_of(" \n"));
    do_some_operation(arr);
}

int main()
{
    unsigned long long int read_bytes = 45 * 1024 *1024;
    const char* fname = "input.txt";
    ifstream fin(fname, ios::in);
    char* memblock;

    while(!fin.eof())
    {
        memblock = new char[read_bytes];
        fin.read(memblock, read_bytes);
        string str(memblock);
        process_data(str);
        delete [] memblock;
    }
    return 0;
}

I am relatively new to c++. When I run this code, I am facing these problems.

Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).
This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?
Is it necessary to include "\n" along with blank character in the boost split function?

I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.

265

asked Nov 04 '14 13:11

Pattu

1 Answers

I'd redesign this to act streaming, instead of on a block.

A simpler approach would be:

std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});

If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.

Alternatively you can use a memory mapped file and iterate over the character sequence.

How to parse space-separated floats in C++ quickly? shows these approaches with benchmarks for floats.

Update I modified the above program to parse uint32_ts into a vector.

When using a sample input file of 4.5GiB^[1] the program runs in 9 seconds^[2]:

sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size():   402653184
0:08.96 elapsed, 6 context switches

Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when you read a 45 GB file, you will need an estimated 15GiB of RAM to just store the vector (assuming no fragmentation on reallocation): The 45GiB parse completes in 10min 45s:

make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size():   4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches

By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

Full Code Used For Benchmark

#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>

namespace qi = boost::spirit::qi;

typedef std::vector<uint32_t> data_t;

using hrclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
    if (argc<2) return 255;
    data_t data;
    data.reserve(4392580288);   // for the  45 GiB file benchmark
    // data.reserve(402653284); // for the 4.5 GiB file benchmark

    boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    using namespace qi;

    auto start_parse = hrclock::now();
    bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
    auto stop_time = hrclock::now();

    if (ok)   
        std::cout << "Parse success\n";
    else 
        std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";

    if (f!=l) 
        std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";

    std::cout << "Data.size():   " << data.size() << "\n";
    std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
}

^[1] generated with od -t u4 /dev/urandom -A none -v -w4 | pv | dd bs=1M count=$((9*1024/2)) iflag=fullblock > smaller.txt

^[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit

112

answered Oct 21 '22 23:10

sehe

Related questions
                            
                                Why does visual studio create a .LIB along with the .DLL?
                            
                                Specializing std::optional
                            
                                What does a dangerous relocation error mean?
                            
                                How can a std::reference_wrapper to a rvalue lambda work?
                            
                                Limit of multiple inheritance in C++
                            
                                what's the difference between mid=(beg+end)/2 and mid=beg+(end-beg)/2 in binary search?
                            
                                Optimal way to access std::tuple element in runtime by index
                            
                                Bilinear interpolation in C/C++ and CUDA
                            
                                Making an adjacency list in C++ for a directed graph
                            
                                "type-switch" construct in C++11
                            
                                STL "erase-remove" idiom: Why not "resize-remove"?
                            
                                How to create a variadic template function with `std::function` as a function parameter?
                            
                                Windows 7 exception code: 0xc0000409
                            
                                enable_shared_from_this not working on xcode 5
                            
                                why C++ operator overloading requires "having at least one parameter of class type"?
                            
                                How to add include path to flycheck c/c++-clang?
                            
                                read huge text file line by line in C++ with buffering
                            
                                C++ operator overload performance issue
                            
                                How do I compute the absolute value of a vector in Eigen?
                            
                                How to find modulo of a sum of numbers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently reading a very large text file in C++

Tags:

c++

linux

boost

mmap

external-sorting