Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is pythonic way to do dt[,y:=myfun(x),by=list(a,b,c)] in R?

Tags:

python

pandas

r

Suppose I have a data frame which have column x, a, b, c And I would like to aggregate over a, b, c to get a value y from a list of x via a function myfun, then duplicate the value for all rows within each window/partition.

In R in data.table this is just 1 line: dt[,y:=myfun(x),by=list(a,b,c)].

In Python the only way I think of is do something like this:

 # To simulate rows in a data frame
 class Record:
      def __init__(self, x, a, b, c):
           self.x = x
           self.a = a
           self.b = b
           self.c = c

 # Assume we have a list of Record as df
 mykey = attrgetter('a', 'b', 'c')
 for key, group_iter in itertools.groupby(sorted(df, key=mykey), key=mykey):
     group = list(group_iter)
     y = myfun(x.x for x in group)
     for x in group:
         x.y = y

Although the logic is quite clear, I am not 100% happy with it. Is there any better approach?

I am not very familiar with pandas. Does it help in such case?

Side question: is there a category that my problem belongs to? aggregation? partition? window? This pattern happens so frequently in data analysis, there must be an existing name for it.

like image 246
colinfang Avatar asked Dec 06 '13 20:12

colinfang


1 Answers

Use a DataFrame and its groupby method from pandas:

import pandas as pd
df = pd.DataFrame({'a': ['x', 'y', 'x', 'y'],
                   'x': [1, 2, 3, 4]})

df.groupby('a').apply(myfun)

The exact usage depends on how you wrote your function myfun. Where the column used is static (e.g. always x) I write myfun to take the full DataFrame and subset inside the function. However if your function is written to accept a vector (or a pandas Series), you can also select the column and apply your function to it:

df.groupby('a')['x'].apply(myfun)

FWIW, it is also often convenient to return a pd.Series object when you're using groupby.


To answer your side question, this is known as the split-apply-combine strategy of data processing. See here for more info.

like image 198
Justin Avatar answered Oct 26 '22 11:10

Justin