Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to parse every fourth line from a very large file

I have a file of the following format:

1: some_basic_info_in_this_line
2: LOTS_OF_INFO_IN_THIS_LINE_HUNDREDS_OF_CHARS
3: some_basic_info_in_this_line
4: LOTS_OF_INFO_IN_THIS_LINE_HUNDREDS_OF_CHARS
...

That format repeats itself tens of thousands of times, making files up to 50 GiB+. I need an efficient way to process the only the line 2 of this format. I'm open to using C, C++11 STL, or boost. I've looked at various other questions regarding file streaming on SO, but I feel like my situation is unique because of the large file size and only needing one out of every four lines.

Memory mapping the file seems to be the most efficient from what I've read, but mapping a 50+ GB file will eat up most computers RAM (you can assume that this application will be used by "average" users - say 4-8 GiB RAM). Also I will only need to process one of the lines at a time. Here is how I am currently doing this (yes I'm aware this is not efficient, that's why I'm redesigning it):

std::string GL::getRead(ifstream& input)
{
    std::string str;
    std::string toss;
    if (input.good())
    {
        getline(input, toss);
        getline(input, str);
        getline(input, toss);
        getline(input, toss);
    }
    return str;
}

Is breaking the mmap into blocks the answer for my situation? Is there anyway that I can leverage only needing 1 out of 4 lines? Thanks for the help.

like image 601
zeus_masta_funk Avatar asked Oct 17 '15 18:10

zeus_masta_funk


1 Answers

Use ignore instead of getline:

std::string GL::getRead(ifstream& input)
{
    std::string str;
    if (!input.fail())
    {
        input.ignore(LARGE_NUMBER, '\n');
        getline(input, str);
        input.ignore(LARGE_NUMBER, '\n');
        input.ignore(LARGE_NUMBER, '\n');
    }
    return str;
}

LARGE_NUMBER could be std::numeric_limits<std::streamsize>::max() if you don't have a good reason to have a smaller number (think of DOS attacks)

TIP Consider passing str by reference. By reading into the same string each time, you can avoid a lot of allocations, which are typically the number 1 reason your program runs slow.

TIP Consider using a memoery mapped file (Boost Iostreams, Boost Interpocess, or mmap(1))

like image 129
sehe Avatar answered Oct 12 '22 23:10

sehe