Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

handling many huge log files with python

Tags:

python

logging

I'm using some python scripts to do statistics. One sort content of logs are like this I call it A logs: every A logs has the format of:

[2012-09-12 12:23:33] SOME_UNIQ_ID filesize

another logs I call it B logs has the format of:

[2012-09-12 12:24:00] SOME_UNIQ_ID

I need to count how many records in the A logs are also in the B logs, and get the time gap of the two records with the same record id.My implementation was load all time and ID of B logs into a map,then iterate the A logs to check if it's ID was exist in the map.The problem is it casts too much memory cause I have almost 100 million records in B logs.Any suggestion to improve the performance and memory usage? Thanks.

like image 222
rpbear Avatar asked Sep 15 '12 08:09

rpbear


2 Answers

You could try reversing the lookup depending if "A" fits into memory and sequentially scan "B".

Otherwise, load the log files into a SQLite3 database with two tables (log_a, log_b) containing (timestamp, uniq_id, rest_of_line), then execute an SQL join on uniq_id, and do any processing you require on the results from that. This will keep the memory overhead low, enables the SQL engine to do the join, but of course does require effectively duplicating the log files on-disk (but that's generally not an issue on most systems)

example

import sqlite3
from datetime import datetime

db = sqlite3.connect(':memory:')

db.execute('create table log_a (timestamp, uniq_id, filesize)')
a = ['[2012-09-12 12:23:33] SOME_UNIQ_ID filesize']
for line in a:
    timestamp, uniq_id, filesize = line.rsplit(' ', 2)
    db.execute('insert into log_a values(?, ?, ?)', (timestamp, uniq_id, filesize))
db.commit()

db.execute('create table log_b (timestamp, uniq_id)')
b = ['[2012-09-12 13:23:33] SOME_UNIQ_ID']
for line in b:
    timestamp, uniq_id = line.rsplit(' ', 1)
    db.execute('insert into log_b values(?, ?)', (timestamp, uniq_id))
db.commit()

TIME_FORMAT = '[%Y-%m-%d %H:%M:%S]'
for matches in db.execute('select * from log_a join log_b using (uniq_id)'):
    log_a_ts = datetime.strptime(matches[0], TIME_FORMAT)
    log_b_ts = datetime.strptime(matches[3], TIME_FORMAT)
    print matches[1], 'has a difference of', abs(log_a_ts - log_b_ts)
    # 'SOME_UNIQ_ID has a difference of 1:00:00'
    # '1:00:00' == datetime.timedelta(0, 3600)

Note that:

  • the .connect on sqlite3 should be a filename
  • a and b should be your files
like image 183
Jon Clements Avatar answered Nov 11 '22 13:11

Jon Clements


Try this:

  • Externally sort both the files
  • Read the A Logs file and save SOME_UNIQ_ID (A)
  • Read the B Logs file and save SOME_UNIQ_ID (B)
  • Compare the SOME_UNIQ_ID (B) with SOME_UNIQ_ID (A)
    • If it is lesser, read B Logs file again
    • If it is greater, read A Logs file again and compare with saved SOME_UNIQ_ID (B)
    • If it is equal find the time gap

Assuming external sort works efficiently, you end up the process reading both files just once.

like image 33
Geordee Naliyath Avatar answered Nov 11 '22 11:11

Geordee Naliyath