Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a file backwards to find substring efficiently

I have a huge log file in this kind of structure:

"timestamp": {"identifier":value}

"1463403600":{"AA":74.42},
"1463403601":{"AA":29.55},
"1463403603":{"AA":24.78},
"1463403604":{"AA":8.46},
"1463403605":{"AA":44.84},
"1463403607":{"AA":87.05},
"1463403608":{"AA":54.81},
"1463403609":{"AA":93.1},
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},

I want to extract the content after(!) a given timestamp like:

std::ifstream * myfunc( uint32_t timestamp) 

example:

myfunc(1463403611);
/* returns
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},
*/

The logfile is long - too long to keep it in memory. The code will run on a resource limited embedded devices (80Mhz, ~10kB free memory), so Im looking for some ideas for a effective solution.

The logfile might have 500k+ entries and in 99% of the time the timestamp will be in the last 100 lines, so starting at the beginnig of the file and check every line for the right timestamp would be very inefficient.

So I guess Im looking for a solution to read the file backwards, line by line. I dont realy have a solution how to do that efficient without loading big chunks into memory.

I tried with reading in chunks of 200bytes starting from the EOF, but was faced with the issue, that the chunk cut the timestamp into half in many cases. I tried to detect that and reselect some bytes if needed, but got the feeling, that there must be a smarte solution.

like image 967
Res ulma Avatar asked May 16 '16 13:05

Res ulma


3 Answers

Well I found this kind of interesting so I knocked up a proof-of-concept for the binary-search idea.

This is poorly tested and probably a little buggy but seems to work so far and demonstrates the idea of divide-and-conquer. You check in the middle of the file and, depending if you are to high or too low, you divide the data into two and search the relevant half. You do that recursively until you get close enough.

#include <ctime>
#include <cmath>
#include <cstdlib>
#include <string>
#include <fstream>
#include <iostream>

// Don't use this, its just to show how many reads
// are being done to find the record.
int global_counter;

std::streampos find_stamp(std::istream& is, long stamp, std::streampos pos, std::streampos end)
{
    ++global_counter;

    if(pos == 0) // can't divide zero
        return 0;

    std::string s;
    long found_stamp;

    // extract nearest timestamp after pos
    is.seekg(pos);
    if(!(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp))
        return end;

    // if its too big check first half of this region
    if(found_stamp > stamp)
        return find_stamp(is, stamp, pos / 2, pos);

    // if its not within 10 timestamp seconds check end half of this region
    if(stamp - found_stamp > 10)
        return find_stamp(is, stamp, (pos + end) / 2, end);

    // read record by record (prolly more efficient than skipping)
    pos = is.tellg();
    while(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp)
    {
        if(found_stamp > stamp)
            return pos;
        pos = is.tellg();
    }
    return end;
}

void print_after(const std::string& filename, long stamp)
{
    // open at end of file (to get length)
    std::ifstream ifs(filename, std::ios::ate);

    std::streampos end = ifs.tellg();
    auto pos = end / 2; // start checking in middle

    // find position before required record
    // (may be in the middle of a record)
    if((pos = find_stamp(ifs, stamp, pos, end)) != end)
    {
        ifs.seekg(pos);

        std::string line;
        std::getline(ifs, line, ','); // skip to next whole record

        // print out all following recors
        while(std::getline(ifs, line, ','))
            std::cout << line;
    }
}

inline
std::string leading_zeros(int n, int zeros = 2)
{
    std::string s;
    for(int z = std::pow(10, zeros - 1); z; z /= 10)
        s += (n < z ? "0":"");
    return s + std::to_string(n);
}

int main()
{
    std::srand(std::time(0));

    // generate some test data
    std::ofstream ofs("test.txt");

    for(int i = 0; i < 1000; ++i)
    {
        ofs << '"' << leading_zeros(i, 10) << '"';
        ofs << ":{\"AA\":" << (std::rand() % 100);
        ofs << '.' << (std::rand() % 100) << "},\n";
    }

    ofs.close();

    global_counter = 0;
    print_after("test.txt", 993);

    std::cout << "find checked " << global_counter << " places in the file\n";
}

Output:

"0000000994":{"AA":80.6}
"0000000995":{"AA":11.90}
"0000000996":{"AA":16.43}
"0000000997":{"AA":53.11}
"0000000998":{"AA":68.43}
"0000000999":{"AA":79.77}
find checked 6 places in the file
like image 79
Galik Avatar answered Nov 16 '22 02:11

Galik


Since you are on an embedded device where mmap() is likely not available, I guess the only viable approach is to use a buffer that you fill with a part of the file, to be able to examine its contents one chunk at a time. Note that you will need to overlap your buffer windows to avoid missing a line that is cut in half by the beginning of the buffer. You will need to seek for the first newline at the beginning of a chunk and discard it with anything before it before you can start examining the chunk for the timestamps. Discarding the partial line at the beginning of the buffer also helps in aligning the end of that same line with the end of the buffer when you load the previous chunk into your buffer.

The handling of incomplete lines at the beginning of the buffer makes this a very ugly and error prone approach. This is the reason why I would suggest using mmap() if at all available, it would allow you to just ignore these issues.

like image 31
cmaster - reinstate monica Avatar answered Nov 16 '22 03:11

cmaster - reinstate monica


If performance is not an issue, you can read the entire file line by line till you reach the time requested, and then start dump. There is no reason to read the entire file in memory. If performance is an issue, seek at half the file, check the timestamp, then divide by two again in a binary search fashion.

like image 3
Felice Pollano Avatar answered Nov 16 '22 03:11

Felice Pollano