Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimized processing of very large files

My task is relatively simple: for each line in an input file, test whether the line satisfies a given set of conditions, and if so, write specific columns of that line to a new file. I've written a python script that does this, but I'd like some help on 1) improving speed, 2) the best way to work in terms of column names (as column numbers can vary from file to file), and 3) the best way to specify my filtering conditions and desired output columns.

1) The files I work with contain photometry for astronomical images. Each file is around 1e6 lines by 150 columns of floats, typically over 1GB in size. I have an old AWK script that will process files like this in about 1 minute; my python script takes between 5 and 7 minutes. I often need to tweak the filtering conditions and rerun several times until the output file is what I want, so speed is definitely desirable. I've found that the for loop is plenty fast; it's how I do things inside the loop that slow it down. Using itemgetter to pick out just the columns I want was a big improvement over reading the entire line into memory, but I'm unsure of what I can do to further increase speed. Can this ever be as fast as AWK?

2) I'd like to work in terms of column names instead of column numbers since the column number of a particular quantity (photon counts, background, signal-to-noise, etc) can change between files. In my AWK script, I always need to check that the column numbers are correct where conditions and output columns are specified, even if the filtering and output apply to the same quantities. My solution in python has been to create a dictionary that assigns a column number to each quantity. When a file has different columns, I only need to specify a new dictionary. Perhaps there is a better way to do this?

3) Ideally, I would only need to specify the names of the input and output files, the filtering conditions, and desired columns to output, and they would be found at the top of my script so I wouldn't need to go searching through the code just to tweak something. My main issue has been with undefined variables. For example, a typical condition is 'SNR > 4', but 'SNR' (signal-to-noise) isn't actually assigned a value until lines start being read from the photometry file. My solution has been to use a combination of strings and eval/exec. Again, maybe there is a better way?

I'm not at all trained in computer science (I'm a grad student in astronomy) - I typically just hack something together and debug until it works. However, optimization with regard to my three points above has become extremely important for my research. I apologize for the lengthy post, but I felt that the details would be helpful. Any and all advice you have for me, in addition to just cleaning things up/coding style, would be greatly appreciated.

Thanks so much, Jake

#! /usr/bin/env python2.6

from operator import itemgetter


infile = 'ugc4305_1.phot'
outfile = 'ugc4305_1_filt.phot'

# names must belong to dicitonary
conditions = 'OBJ <= 2 and SNR1 > 4 and SNR2 > 4 and FLAG1 < 8 and FLAG2 < 8 and (SHARP1 + SHARP2)**2 < 0.075 and (CROWD1 + CROWD2) < 0.1'

input = 'OBJ, SNR1, SNR2, FLAG1, FLAG2, SHARP1, SHARP2, CROWD1, CROWD2'
    # should contain all quantities used in conditions

output = 'X, Y, OBJ, COUNTS1, BG1, ACS1, ERR1, CHI1, SNR1, SHARP1, ROUND1, CROWD1, FLAG1, COUNTS2, BG2, ACS2, ERR2, CHI2, SNR2, SHARP2, ROUND2, CROWD2, FLAG2'

# dictionary of col. numbers for the more important qunatities
columns = dict(EXT=0, CHIP=1, X=2, Y=3, CHI_GL=4, SNR_GL=5, SHARP_GL=6, ROUND_GL=7, MAJAX_GL=8, CROWD_GL=9, OBJ=10, COUNTS1=11, BG1=12, ACS1=13, STD1=14, ERR1=15, CHI1=16, SNR1=17, SHARP1=18, ROUND1=19, CROWD1=20, FWHM1=21, ELLIP1=22, PSFA1=23, PSFB1=24, PSFC1=25, FLAG1=26, COUNTS2=27, BG2=28, ACS2=29, STD2=30, ERR2=31, CHI2=32, SNR2=33, SHARP2=34, ROUND2=35, CROWD2=36, FWHM2=37, ELLIP2=38, PSFA2=39, PSFB2=40, PSFC2=41, FLAG2=42)



f = open(infile)
g = open(outfile, 'w')


# make string that extracts values for testing
input_items = []
for i in input.replace(',', ' ').split():
    input_items.append(columns[i])
input_items = ', '.join(str(i) for i in input_items)

var_assign = '%s = [eval(i) for i in itemgetter(%s)(line.split())]' % (input, input_items) 


# make string that specifies values for writing
output_items = []
for i in output.replace(',', ' ').split():
    output_items.append(columns[i])
output_items = ', '.join(str(i) for i in output_items)

output_values = 'itemgetter(%s)(line.split())' % output_items


# make string that specifies format for writing
string_format = []
for i in output.replace(',', ' ').split():
    string_format.append('%s')
string_format = ' '.join(string_format)+'\n'


# main loop
for line in f:
   exec(var_assign)
   if eval(conditions):
      g.write(string_format % tuple(eval(output_values)))
f.close()
g.close()
like image 270
Jake Avatar asked Feb 25 '11 19:02

Jake


2 Answers

I don't think you mentioned it, but it looks like your data is in csv. You might get a lot out of using csv.DictReader. You can iterate over files 1 line at a time (avoiding loading the whole thing into memory) and refer to columns by their names.

You should also take a look at cProfile, Python's profiler, if you haven't already. It will tell you what bits of your program are taking the most time to execute.

like image 174
nmichaels Avatar answered Oct 13 '22 10:10

nmichaels


My first step here, would be to get rid of the exec() and eval() calls. Each time you eval a string, it has to be compiled, and then executed, adding to the overhead of your function call on every line of your file. Not to mention, eval tends to lead to messy, hard to debug code, and should generally be avoided.

You can start refactoring by putting your logic into a small, easily understandable functions. For example, you can replace eval(conditions) with a function, e.g.:

def conditions(d):
    return (d[OBJ] <= 2 and
            d[SNRI] > 4 and
            d[SNR2] > 4 and
            d[FLAG1] < 8 and ...

Tip: if some of your conditionals have higher probability of failing, put them in first, and python will skip the evaluation of the rest.

I would get rid of the dictionary of column names, and simply set a bunch of variables at the top of your file, then refer to columns by line[COLNAME]. This may help you simplify some parts like the conditions function, and you can refer to the columns by name, without having to assign each variable.

like image 24
JimB Avatar answered Oct 13 '22 12:10

JimB