Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an implementation of Hadley's ddply for python?

Tags:

python

r

plyr

I find Hadley's plyr package for R extremely helpful, its a great DSL for transforming data. The problem that is solves is so common, that I face it other use cases, when not manipulating data in R, but in other programming languages.

Does anyone know if there exists an a module that does a similar thing for python? Something like:

def ddply(rows, *cols, op=lambda group_rows: group_rows):
    """group rows by cols, then apply the function op to each group
       and return the results aggregating all groups
       rows is a dict or list of values read by csv.reader or csv.DictReader"""
    pass

It shouldn't be too difficult to implement, but would be great if it already existed. I'd implement it, I'd use itertools.groupby to group by cols, then apply the op function, then use itertools.chain to chain it all up. Is there a better solution?

like image 302
rafalotufo Avatar asked Jun 22 '11 01:06

rafalotufo


1 Answers

This is the implementation I drafted up:

def ddply(rows, cols, op=lambda group_rows: group_rows): 
    """group rows by cols, then apply the function op to each group 
    rows is list of values or dict with col names (like read from 
    csv.reader or   csv.DictReader)"""
    def group_key(row):                         
        return (row[col] for col in cols)
    rows = sorted(rows, key=group_key)
    return itertools.chain.from_iterable(
        op(group_rows) for k,group_rows in itertools.groupby(rows, key=group_key)) 

Another step would be to have a set of predefined functions that could be applied as op, like sum and other utility functions.

like image 170
rafalotufo Avatar answered Sep 18 '22 06:09

rafalotufo