The runtime of this code on my laptop for a 4.2 GB input file is 48 seconds. The input file is tab-delimited, with each value appearing in quotes. Each record ends with a newline, e.g. '"val1"\t"val2"\t"val3"\t..."valn"\n'
I've tried using multiprocessing with 10 threads: One to queue up the lines, 8 to parse individual lines and fill an output queue, and one to reduce the output queue into the defaultdict shown below, but the code took 300 seconds to run, over 6 times longer than the following:
from collections import defaultdict
def get_users(log):
users = defaultdict(int)
f = open(log)
# Read header line
h = f.readline().strip().replace('"', '').split('\t')
ix_profile = h.index('profile.type')
ix_user = h.index('profile.id')
# If either ix_* is the last field in h, it will include a newline.
# That's fine for now.
for (i, line) in enumerate(f):
if i % 1000000 == 0: print "Line %d" % i # progress notification
l = line.split('\t')
if l[ix_profile] != '"7"': # "7" indicates a bad value
# use list slicing to remove quotes
users[l[ix_user][1:-1]] += 1
f.close()
return users
I've checked that I'm not I/O-bound by removing everything but the print statement from the for loop. That code ran in 9 seconds, which I'll consider a lower-bound for how fast this code can run.
I have a lot of these 5 GB files to process, so even a pretty small improvement in runtime (I know, I can remove the print!) will help. The machine I am running on has 4 cores, so I can't help but wonder if there's a way to get the multithread/multiprocess code to run faster than the code above.
UPDATE:
I rewrote the multiprocessing code as follows:
from multiprocessing import Pool, cpu_count
from collections import defaultdict
def parse(line, ix_profile=10, ix_user=9):
"""ix_profile and ix_user predetermined; hard-coding for expedience."""
l = line.split('\t')
if l[ix_profile] != '"7"':
return l[ix_user][1:-1]
def get_users_mp():
f = open('20110201.txt')
h = f.readline() # remove header line
pool = Pool(processes=cpu_count())
result_iter = pool.imap_unordered(parse, f, 100)
users = defaultdict(int)
for r in result_iter:
if r is not None:
users[r] += 1
return users
It runs in 26 seconds, a 1.85x speedup. Not bad, but with 4 cores, not as much as I had hoped for.
Use regular expressions.
A test determines that the expensive part of the process is the call to str.split(). Probably having to construct a list and a bunch of string objects for every line is expensive.
Firstly, you need to construct a regular expression to match against the line. Something like:
expression = re.compile(r'("[^"]")\t("[^"]")\t')
If you call expression.match(line).groups(), you'll get the first two columns extracted as two string objects and you can do logic with those directly.
Now this assumes that the two columns of interest are the first two columns. If not you'll just have to tweak the regular expression to match the correct columns. Your code checks the header to see where the columns are located. You can generate the regular expression based on that, but I'm gonna guess that the columns are really always located at the same place. Just verify that they are still there and use a regular expression on the lines.
EDIT
from collections import defaultdict import re
def get_users(log):
f = open(log)
# Read header line
h = f.readline().strip().replace('\'', '').split('\t')
ix_profile = h.index('profile.type')
ix_user = h.index('profile.id')
assert ix_user < ix_profile
This code assumes that user is before profile
keep_field = r'"([^"]*)"'
This regular expression will capture a single column
skip_field = r'"[^"]*"'
This regular expression will match the column, but not capture the results. (Note the lack of parenthesis)
fields = [skip_field] * len(h)
fields[ix_profile] = keep_field
fields[ix_user] = keep_field
Create a list for all the fields, and only the keep the ones we care about
del fields[max(ix_profile, ix_user)+1:]
Eliminate all the fields after the ones we care about (they take time to match, and we don't care about them)
regex = re.compile(r"\t".join(fields))
Actually produce the regex.
users = defaultdict(int)
for line in f:
user, profile = regex.match(line).groups()
Pull out the two values, and do the logic
if profile != "7": # "7" indicates a bad value
# use list slicing to remove quotes
users[user] += 1
f.close()
return users
If you're running unix or cygwin, the following little script would produce you the frequency of user id's where profile != 7. Should be quick.
Updated with awk to count the user ids
#!/bin/bash
FILENAME="test.txt"
IX_PROFILE=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.type | cut -f1`
IX_USER=`head -1 ${FILENAME} | sed -e 's/\t/\n/g' | nl -w 1 | grep profile.id | cut -f1`
# Just the userids
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2
# userids counted:
# sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | sort | uniq -c
# Count using awk..?
sed 1d ${FILENAME} | cut -f${IX_PROFILE},${IX_USER} | grep -v \"7\" | cut -f2 | awk '{ count[$1]++; } END { for (x in count) { print x "\t" count[x] } }'
Seeing that your log file is tab-delimited, you can use the csv
module - with a dialect='excel-tab'
argument - for a nice performance and readability boost. That is, of course, if you have to use Python instead of the much faster console commands.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With