I find Hadley's plyr package for R extremely helpful, its a great DSL for transforming data. The problem that is solves is so common, that I face it other use cases, when not manipulating data in R, but in other programming languages.
Does anyone know if there exists an a module that does a similar thing for python? Something like:
def ddply(rows, *cols, op=lambda group_rows: group_rows):
"""group rows by cols, then apply the function op to each group
and return the results aggregating all groups
rows is a dict or list of values read by csv.reader or csv.DictReader"""
pass
It shouldn't be too difficult to implement, but would be great if it already existed. I'd implement it, I'd use itertools.groupby
to group by cols
, then apply the op
function, then use itertools.chain to chain it all up. Is there a better solution?
This is the implementation I drafted up:
def ddply(rows, cols, op=lambda group_rows: group_rows):
"""group rows by cols, then apply the function op to each group
rows is list of values or dict with col names (like read from
csv.reader or csv.DictReader)"""
def group_key(row):
return (row[col] for col in cols)
rows = sorted(rows, key=group_key)
return itertools.chain.from_iterable(
op(group_rows) for k,group_rows in itertools.groupby(rows, key=group_key))
Another step would be to have a set of predefined functions that could be applied as op
, like sum
and other utility functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With