Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement non overlapping rolling functionality on MultiIndex DataFrame

So far I've found this question but it doesn't solve my problem due to the facts that:

  1. I have a MultiIndex DataFrame
  2. The inner level has different amount of data for each outer level, thus I can't use len()

I have the following DataFrame

Outer Inner    Value        
  A     1     2.000000
  A     2     4.000000
  A     3     6.000000
  A     4     8.000000
  B     1     3.000000
  B     2     6.000000
  B     3     9.000000
  B     4     12.000000
  B     5     15.000000

I want to sum the last two values for each outer in a non-overlapping manner. So for A I want to sum inner's 3 + 4, 1 + 2. For B I want to sum inner's 4 + 5, 2 + 3. Note that the pairwise sum is supposed to start from the last value. Resulting in

Outer Inner    Value        
  A     2     6.000000
  A     4    14.000000
  B     3    15.000000
  B     5    27.000000
like image 646
Pedro Braz Avatar asked Oct 29 '22 23:10

Pedro Braz


1 Answers

Groupby with custom resample function

You will most likely need custom resampling to do this. It is a little hacky but might work.

  1. Remove all MulitIndexing to deal with just regular column groupby()s
  2. groupby() 'Outer' and .apply() a custom function to each group
  3. The custom function takes a group
    1. Determine the even length of the group
    2. Select that length backwards
    3. Turn index into seconds
    4. Resample the DataFrame every two samples by resample(...).sum()
    5. Resample the Inner column every two by resample(...).last() to preserve original index numbers
    6. Convert index back to 'Inner'
  4. Even though we removed the MultiIndex, a MultiIndex is still returned by groupby(...).apply()

Note: There is an issue with rolling, as it slides thru the values instead of stepping thru the values (in a non-overlapping method). Using resample allows this. Resample is time based the index needs to be represented as seconds.

Example

import math
import pandas as pd

df = pd.DataFrame({
    'Outer': ['A','A','A','A','B','B','B','B','B'],
    'Inner': [1,2,3,4,1,2,3,4,5],
    'Value': [2.00,4.00,6.00,8.00,3.00,6.00,9.00,12.00,15.00]
})

def f(g):
    even_length = int(2.0 * math.floor(len(g) / 2.0))
    every_two_backwards = g.iloc[-even_length:]
    every_two_backwards.index = pd.TimedeltaIndex(every_two_backwards.index * 1000000000.0)
    resample_via_sum = every_two_backwards.resample('2s').sum().dropna()
    resample_via_sum['Inner'] = every_two_backwards.resample('2s').last()
    resample_via_sum = resample_via_sum.set_index('Inner')

    return resample_via_sum

resampled_df = df.groupby(['Outer']).apply(f)

print resampled_df

             Value
Outer Inner       
A     2.0      6.0
      4.0     14.0
B     3.0     15.0
      5.0     27.0
like image 103
tmthydvnprt Avatar answered Nov 15 '22 05:11

tmthydvnprt