i perform performance tests for a few java applications. Applications produce very big log files( it can be 7-10 GB) during the test . I need to trim these log files between specific dates and time. currently, i use python script, which parses log timestamps in datetime python object and print only matched strings. But this solution is very slow. 5 GB log is parsed about 25 minutes Obviously entries in log file is sequentially and i don't need to read all file line by line. I thought about reading file from the start and from the end, until condition is matched and print files between matched number of lines. But i don't know how can i read file from the backwards, without downloading it to the memory.
Please, can you suggest me any suitibale solution for this case.
here is part of python script:
lfmt = '%Y-%m-%d %H:%M:%S'
file = open(filename, 'rU')
normal_line = ''
for line in file:
if line[0] == '[':
ltimestamp = datetime.strptime(line[1:20], lfmt)
if ltimestamp >= str and ltimestamp <= end:
normal_line = 'True'
else:
normal_line = ''
if normal_line:
print line,
As the data is sequential if the start and end of the region of interest are near the beginning of the file, then reading from the end of the file (to find the matching end point) is still a bad solution!
I've written some code that will quickly find the start and end points as you require, this approach is called binary search and is similar to the clasic childrens "higher or lower" guessing game!
The script reads a trial line mid-way between lower_bounds
and upper_bounds
(initially the SOF and EOF), and checks the match criteria. If the sought line is earlier, then it guesses again by reading a line half-way between the lower_bound
and the previous read trial (if its higher then it splits between its guess and the upper bound). So you keep iterating between upper and lower bounds - this yields the fastest possible "on average" solution.
This should be a real quick solution (log to the base 2 of the number of lines!!). For example in the worst possible case (finding line 999 out of a 1000 lines), using binary search would take just 9 line reads! (from a billion lines would take just 30...)
Assumptions for the below code:
Further:
file.seek()
is superior), thanks to TerryE and J.F. Sebastian for pointing that out.import datetime
def match(line):
lfmt = '%Y-%m-%d %H:%M:%S'
if line[0] == '[':
return datetime.datetime.strptime(line[1:20], lfmt)
def retrieve_test_line(position):
file.seek(position,0)
file.readline() # avoids reading partial line, which will mess up match attempt
new_position = file.tell() # gets start of line position
return file.readline(), new_position
def check_lower_bound(position):
file.seek(position,0)
new_position = file.tell() # gets start of line position
return file.readline(), new_position
def find_line(target, lower_bound, upper_bound):
trial = int((lower_bound + upper_bound) /2)
inspection_text, position = retrieve_test_line(trial)
if position == upper_bound:
text, position = check_lower_bound(lower_bound)
if match(text) == target:
return position
return # no match for target within range
matched_position = match(inspection_text)
if matched_position == target:
return position
elif matched_position < target:
return find_line(target, position, upper_bound)
elif matched_position > target:
return find_line(target, lower_bound, position)
else:
return # no match for target within range
lfmt = '%Y-%m-%d %H:%M:%S'
# start_target = # first line you are trying to find:
start_target = datetime.datetime.strptime("2012-02-01 13:10:00", lfmt)
# end_target = # last line you are trying to find:
end_target = datetime.datetime.strptime("2012-02-01 13:39:00", lfmt)
file = open("log_file.txt","r")
lower_bound = 0
file.seek(0,2) # find upper bound
upper_bound = file.tell()
sequence_start = find_line(start_target, lower_bound, upper_bound)
if sequence_start or sequence_start == 0: #allow for starting at zero - corner case
sequence_end = find_line(end_target, sequence_start, upper_bound)
if not sequence_end:
print "start_target match: ", sequence_start
print "end match is not present in the current file"
else:
print "start match is not present in the current file"
if (sequence_start or sequence_start == 0) and sequence_end:
print "start_target match: ", sequence_start
print "end_target match: ", sequence_end
print
print start_target, 'target'
file.seek(sequence_start,0)
print file.readline()
print end_target, 'target'
file.seek(sequence_end,0)
print file.readline()
5 GB log is parsed about 25 minutes
It is ~3MB/s. Even sequential O(n)
scan in Python can do much better (~500MB/s for wc-l.py
) i.e., the performance should be limited only by I/O.
To perform a binary search on a file you could adapt FileSearcher that uses fixed records to use lines instead, using approach similar to
tail -n
implementation in Python (it is O(n)
to scan for '\n'
).
To avoid O(n)
(if the date range selects only a small portion of the log) you could use an approximate search that uses large fixed chunks and allows some records to be missed due to they lie on a chunk boundary e.g., use unmodified FileSearcher
with record_size=1MB
and a custom Query
class:
class Query(object):
def __init__(self, query):
self.query = query # e.g., '2012-01-01'
def __lt__(self, chunk):
# assume line starts with a date; find the start of line
i = chunk.find('\n')
# assert '\n' in chunk and len(chunk) > (len(self.query) + i)
# e.g., '2012-01-01' < '2012-03-01'
return self.query < chunk[i+1:i+1+len(self.query)]
To take into account that the date range can span multiple chunks you could modify FileSearcher.__getitem__
to return (filepos, chunk)
and search twice (bisect_left()
, bisect_right()
) to find approximate filepos_mindate
, filepos_maxdate
. After that you could perform a linear search (e.g., using tail -n
approach) around given file positions to find the exact first and last log records.
7 to 10 GB is a large amount of data. If I had to analyse that kind of data, I would either make the applications log to a database or upload the log files to a database. Then there is loads of analysis you can do efficiently on database. If you are using a standard logging tool like Log4J logging to a database should be quite simple. Just suggesting an alternate solution.
For more on database logging you can refer to this post:
A good database log appender for Java?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With