What is pythonic way to do dt[,y:=myfun(x),by=list(a,b,c)] in R?

Question

Suppose I have a data frame which have column x, a, b, c And I would like to aggregate over a, b, c to get a value y from a list of x via a function myfun, then duplicate the value for all rows within each window/partition.

In R in data.table this is just 1 line: dt[,y:=myfun(x),by=list(a,b,c)].

In Python the only way I think of is do something like this:

 # To simulate rows in a data frame
 class Record:
      def __init__(self, x, a, b, c):
           self.x = x
           self.a = a
           self.b = b
           self.c = c

 # Assume we have a list of Record as df
 mykey = attrgetter('a', 'b', 'c')
 for key, group_iter in itertools.groupby(sorted(df, key=mykey), key=mykey):
     group = list(group_iter)
     y = myfun(x.x for x in group)
     for x in group:
         x.y = y

Although the logic is quite clear, I am not 100% happy with it. Is there any better approach?

I am not very familiar with pandas. Does it help in such case?

Side question: is there a category that my problem belongs to? aggregation? partition? window? This pattern happens so frequently in data analysis, there must be an existing name for it.

Justin · Accepted Answer

Use a DataFrame and its groupby method from pandas:

import pandas as pd
df = pd.DataFrame({'a': ['x', 'y', 'x', 'y'],
                   'x': [1, 2, 3, 4]})

df.groupby('a').apply(myfun)

The exact usage depends on how you wrote your function myfun. Where the column used is static (e.g. always x) I write myfun to take the full DataFrame and subset inside the function. However if your function is written to accept a vector (or a pandas Series), you can also select the column and apply your function to it:

df.groupby('a')['x'].apply(myfun)

FWIW, it is also often convenient to return a pd.Series object when you're using groupby.

To answer your side question, this is known as the split-apply-combine strategy of data processing. See here for more info.

What is pythonic way to do dt[,y:=myfun(x),by=list(a,b,c)] in R?

Tags:

python

pandas

r

colinfang

1 Answers

Justin

Recent Activity

Donate For Us

What is pythonic way to do dt[,y:=myfun(x),by=list(a,b,c)] in R?

Tags:

python

pandas

r

colinfang

1 Answers

Justin

Related questions

Recent Activity

Donate For Us