Suppose I have a data frame which have column x, a, b, c
And I would like to aggregate over a, b, c
to get a value y from a list of x via a function myfun
, then duplicate the value for all rows within each window/partition.
In R in data.table
this is just 1 line: dt[,y:=myfun(x),by=list(a,b,c)]
.
In Python the only way I think of is do something like this:
# To simulate rows in a data frame
class Record:
def __init__(self, x, a, b, c):
self.x = x
self.a = a
self.b = b
self.c = c
# Assume we have a list of Record as df
mykey = attrgetter('a', 'b', 'c')
for key, group_iter in itertools.groupby(sorted(df, key=mykey), key=mykey):
group = list(group_iter)
y = myfun(x.x for x in group)
for x in group:
x.y = y
Although the logic is quite clear, I am not 100% happy with it. Is there any better approach?
I am not very familiar with pandas
. Does it help in such case?
Side question: is there a category that my problem belongs to? aggregation? partition? window? This pattern happens so frequently in data analysis, there must be an existing name for it.
Use a DataFrame
and its groupby
method from pandas
:
import pandas as pd
df = pd.DataFrame({'a': ['x', 'y', 'x', 'y'],
'x': [1, 2, 3, 4]})
df.groupby('a').apply(myfun)
The exact usage depends on how you wrote your function myfun
. Where the column used is static (e.g. always x
) I write myfun
to take the full DataFrame
and subset inside the function. However if your function is written to accept a vector (or a pandas Series
), you can also select the column and apply
your function to it:
df.groupby('a')['x'].apply(myfun)
FWIW, it is also often convenient to return a pd.Series
object when you're using groupby
.
To answer your side question, this is known as the split-apply-combine strategy of data processing. See here for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With