I have a huge log file in this kind of structure:
"timestamp": {"identifier":value}
"1463403600":{"AA":74.42},
"1463403601":{"AA":29.55},
"1463403603":{"AA":24.78},
"1463403604":{"AA":8.46},
"1463403605":{"AA":44.84},
"1463403607":{"AA":87.05},
"1463403608":{"AA":54.81},
"1463403609":{"AA":93.1},
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},
I want to extract the content after(!) a given timestamp like:
std::ifstream * myfunc( uint32_t timestamp)
example:
myfunc(1463403611);
/* returns
"1463403611":{"AA":77.64},
"1463403612":{"AA":33.39},
"1463403613":{"AA":69.2},
*/
The logfile is long - too long to keep it in memory. The code will run on a resource limited embedded devices (80Mhz, ~10kB free memory), so Im looking for some ideas for a effective solution.
The logfile might have 500k+ entries and in 99% of the time the timestamp will be in the last 100 lines, so starting at the beginnig of the file and check every line for the right timestamp would be very inefficient.
So I guess Im looking for a solution to read the file backwards, line by line. I dont realy have a solution how to do that efficient without loading big chunks into memory.
I tried with reading in chunks of 200bytes starting from the EOF, but was faced with the issue, that the chunk cut the timestamp into half in many cases. I tried to detect that and reselect some bytes if needed, but got the feeling, that there must be a smarte solution.
Well I found this kind of interesting so I knocked up a proof-of-concept for the binary-search idea.
This is poorly tested and probably a little buggy but seems to work so far and demonstrates the idea of divide-and-conquer. You check in the middle of the file and, depending if you are to high or too low, you divide the data into two and search the relevant half. You do that recursively until you get close enough.
#include <ctime>
#include <cmath>
#include <cstdlib>
#include <string>
#include <fstream>
#include <iostream>
// Don't use this, its just to show how many reads
// are being done to find the record.
int global_counter;
std::streampos find_stamp(std::istream& is, long stamp, std::streampos pos, std::streampos end)
{
++global_counter;
if(pos == 0) // can't divide zero
return 0;
std::string s;
long found_stamp;
// extract nearest timestamp after pos
is.seekg(pos);
if(!(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp))
return end;
// if its too big check first half of this region
if(found_stamp > stamp)
return find_stamp(is, stamp, pos / 2, pos);
// if its not within 10 timestamp seconds check end half of this region
if(stamp - found_stamp > 10)
return find_stamp(is, stamp, (pos + end) / 2, end);
// read record by record (prolly more efficient than skipping)
pos = is.tellg();
while(std::getline(std::getline(is, s, ','), s, '"') >> found_stamp)
{
if(found_stamp > stamp)
return pos;
pos = is.tellg();
}
return end;
}
void print_after(const std::string& filename, long stamp)
{
// open at end of file (to get length)
std::ifstream ifs(filename, std::ios::ate);
std::streampos end = ifs.tellg();
auto pos = end / 2; // start checking in middle
// find position before required record
// (may be in the middle of a record)
if((pos = find_stamp(ifs, stamp, pos, end)) != end)
{
ifs.seekg(pos);
std::string line;
std::getline(ifs, line, ','); // skip to next whole record
// print out all following recors
while(std::getline(ifs, line, ','))
std::cout << line;
}
}
inline
std::string leading_zeros(int n, int zeros = 2)
{
std::string s;
for(int z = std::pow(10, zeros - 1); z; z /= 10)
s += (n < z ? "0":"");
return s + std::to_string(n);
}
int main()
{
std::srand(std::time(0));
// generate some test data
std::ofstream ofs("test.txt");
for(int i = 0; i < 1000; ++i)
{
ofs << '"' << leading_zeros(i, 10) << '"';
ofs << ":{\"AA\":" << (std::rand() % 100);
ofs << '.' << (std::rand() % 100) << "},\n";
}
ofs.close();
global_counter = 0;
print_after("test.txt", 993);
std::cout << "find checked " << global_counter << " places in the file\n";
}
Output:
"0000000994":{"AA":80.6}
"0000000995":{"AA":11.90}
"0000000996":{"AA":16.43}
"0000000997":{"AA":53.11}
"0000000998":{"AA":68.43}
"0000000999":{"AA":79.77}
find checked 6 places in the file
Since you are on an embedded device where mmap()
is likely not available, I guess the only viable approach is to use a buffer that you fill with a part of the file, to be able to examine its contents one chunk at a time. Note that you will need to overlap your buffer windows to avoid missing a line that is cut in half by the beginning of the buffer. You will need to seek for the first newline at the beginning of a chunk and discard it with anything before it before you can start examining the chunk for the timestamps. Discarding the partial line at the beginning of the buffer also helps in aligning the end of that same line with the end of the buffer when you load the previous chunk into your buffer.
The handling of incomplete lines at the beginning of the buffer makes this a very ugly and error prone approach. This is the reason why I would suggest using mmap()
if at all available, it would allow you to just ignore these issues.
If performance is not an issue, you can read the entire file line by line till you reach the time requested, and then start dump. There is no reason to read the entire file in memory. If performance is an issue, seek at half the file, check the timestamp, then divide by two again in a binary search fashion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With