Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize this python log-parsing code

The runtime of this code on my laptop for a 4.2 GB input file is 48 seconds. The input file is tab-delimited, with each value appearing in quotes. Each record ends with a newline, e.g. '"val1"\t"val2"\t"val3"\t..."valn"\n'

I've tried using multiprocessing with 10 threads: One to queue up the lines, 8 to parse individual lines and fill an output queue, and one to reduce the output queue into the defaultdict shown below, but the code took 300 seconds to run, over 6 times longer than the following:

from collections import defaultdict
def get_users(log):
    users = defaultdict(int)
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('"', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')
    # If either ix_* is the last field in h, it will include a newline. 
    # That's fine for now.
    for (i, line) in enumerate(f): 
        if i % 1000000 == 0: print "Line %d" % i # progress notification

        l = line.split('\t')
        if l[ix_profile] != '"7"': # "7" indicates a bad value
            # use list slicing to remove quotes
            users[l[ix_user][1:-1]] += 1 

    f.close()
    return users

I've checked that I'm not I/O-bound by removing everything but the print statement from the for loop. That code ran in 9 seconds, which I'll consider a lower-bound for how fast this code can run.

I have a lot of these 5 GB files to process, so even a pretty small improvement in runtime (I know, I can remove the print!) will help. The machine I am running on has 4 cores, so I can't help but wonder if there's a way to get the multithread/multiprocess code to run faster than the code above.

UPDATE:

I rewrote the multiprocessing code as follows:

from multiprocessing import Pool, cpu_count
from collections import defaultdict

def parse(line, ix_profile=10, ix_user=9):
    """ix_profile and ix_user predetermined; hard-coding for expedience."""
    l = line.split('\t')
    if l[ix_profile] != '"7"':
        return l[ix_user][1:-1]

def get_users_mp():
    f = open('20110201.txt')
    h = f.readline() # remove header line
    pool = Pool(processes=cpu_count())
    result_iter = pool.imap_unordered(parse, f, 100)
    users = defaultdict(int)
    for r in result_iter:
        if r is not None:
            users[r] += 1
    return users

It runs in 26 seconds, a 1.85x speedup. Not bad, but with 4 cores, not as much as I had hoped for.

like image 875
Jason Sundram Avatar asked May 05 '11 18:05

Jason Sundram


3 Answers

Use regular expressions.

A test determines that the expensive part of the process is the call to str.split(). Probably having to construct a list and a bunch of string objects for every line is expensive.

Firstly, you need to construct a regular expression to match against the line. Something like:

expression = re.compile(r'("[^"]")\t("[^"]")\t')

If you call expression.match(line).groups(), you'll get the first two columns extracted as two string objects and you can do logic with those directly.

Now this assumes that the two columns of interest are the first two columns. If not you'll just have to tweak the regular expression to match the correct columns. Your code checks the header to see where the columns are located. You can generate the regular expression based on that, but I'm gonna guess that the columns are really always located at the same place. Just verify that they are still there and use a regular expression on the lines.

EDIT

from collections import defaultdict import re

def get_users(log):
    f = open(log)
    # Read header line
    h = f.readline().strip().replace('\'', '').split('\t')
    ix_profile = h.index('profile.type')
    ix_user = h.index('profile.id')

    assert ix_user < ix_profile

This code assumes that user is before profile

    keep_field = r'"([^"]*)"'

This regular expression will capture a single column

    skip_field = r'"[^"]*"'

This regular expression will match the column, but not capture the results. (Note the lack of parenthesis)

    fields = [skip_field] * len(h)
    fields[ix_profile] = keep_field
    fields[ix_user] = keep_field

Create a list for all the fields, and only the keep the ones we care about

    del fields[max(ix_profile, ix_user)+1:]

Eliminate all the fields after the ones we care about (they take time to match, and we don't care about them)

    regex = re.compile(r"\t".join(fields))

Actually produce the regex.

    users = defaultdict(int)
    for line in f:
        user, profile = regex.match(line).groups()

Pull out the two values, and do the logic

        if profile != "7": # "7" indicates a bad value
            # use list slicing to remove quotes
            users[user] += 1 

    f.close()
    return users
like image 126
Winston Ewert Avatar answered Sep 18 '22 00:09

Winston Ewert


If you're running unix or cygwin, the following little script would produce you the frequency of user id's where profile != 7. Should be quick.

Updated with awk to count the user ids

#!/bin/bash

FILENAME="test.txt"

IX_PROFILE=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.type | cut -f1`
IX_USER=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.id | cut -f1`
# Just the userids
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2

# userids counted:
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | sort | uniq -c

# Count using awk..?
sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | awk '{ count[$1]++; } END { for (x in count) { print x "\t" count[x] } }'
like image 34
MattH Avatar answered Sep 18 '22 00:09

MattH


Seeing that your log file is tab-delimited, you can use the csv module - with a dialect='excel-tab' argument - for a nice performance and readability boost. That is, of course, if you have to use Python instead of the much faster console commands.

like image 26
ktdrv Avatar answered Sep 21 '22 00:09

ktdrv