Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better/more efficient way to do this (vectorised)? Very slow performance with Pandas apply

So in R, I'd use an optimized apply function for this, but I've read now that the Panda's apply function is an abstracted loop and is perhaps even slower than one, and it shows in the performance. On my machine, it took 30 mins to process 60k rows.

So essentially I'm looking to calculate a moving average based on a dataset with different groups on which I need to calculate a moving average. There are a lot of these groups. So I essentially first have to subset the dataset on a row/cell-wise basis, and only then calculate the moving average.

So I'm trying to come up with a vectorised solution to this but can't seem to figure how you'd go about subsetting the dataframe within a vectorised approach.

My current solution makes use of an apply function which is really easy to understand and maintain:

df['SMA'] = df.apply(SMA, axis=1)

def SMA(row):
     Subset = df[(df['group']==row['group'])&(df['t']<=row['t'])].reset_index()
     Subset2 = Subset[len(Subset.index)-(2):len(Subset.index)]
     return df['val'].mean()

This is my expected output (which I am currently getting but just very very slowly):

This is the dataframe and this example I want the moving average over two timepoints, "t" in this example:

t   group   val moving average
1   A        1  NA
2   A        2  1.5
3   A        3  2.5
1   B        4  NA
2   B        5  4.5
3   B        6  5.5
1   C        7  NA
2   C        8  7.5
3   C        9  8.5
like image 358
Topographical Avatar asked Jun 09 '18 10:06

Topographical


1 Answers

This kind of operation (splitting into groups) is handled by the .groupby method in pandas. If we take care to set the index to time, it also handles giving us the right output with a time index back.

Here's an example which does basically the same as your code:

df = pandas.DataFrame(
   [[1, 'A', 1], 
    [2, 'A', 2], 
    [3, 'A', 3], 
    [1, 'B', 4], 
    [2, 'B', 5], 
    [3, 'B', 6], 
    [1, 'C', 7], 
    [2, 'C', 8], 
    [3, 'C', 9]],
    columns=['t', 'group', 'val'])

df = df.set_index('t')
moving_avg = df.groupby('group').rolling(2).mean()

moving_avg is now a new dataframe. Note that because I set the index to be t in the first part, it's handled correctly in the grouping and rolling averages:

         val
group t     
A     1  NaN
      2  1.5
      3  2.5
B     1  NaN
      2  4.5
      3  5.5
C     1  NaN
      2  7.5
      3  8.5
like image 113
chthonicdaemon Avatar answered Nov 15 '22 22:11

chthonicdaemon