Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to downsample a pandas dataframe by 2x2 averaging kernel

I am trying to downsample a pandas dataframe in order to reduce granularity. In example, I want to reduce this dataframe:

1  2  3  4
2  4  3  3
2  2  1  3
3  1  3  2

to this (downsampling to obtain a 2x2 dataframe using mean):

2.25  3.25
2     2.25

Is there a builtin way or efficient way to do it or I have to write it on my own?

Thanks

like image 920
gc5 Avatar asked Sep 16 '13 10:09

gc5


2 Answers

One option is to use groupby twice. Once for the index:

In [11]: df.groupby(lambda x: x//2).mean()
Out[11]:
     0    1  2    3
0  1.5  3.0  3  3.5
1  2.5  1.5  2  2.5

and once for the columns:

In [12]: df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
Out[12]:
      0     1
0  2.25  3.25
1  2.00  2.25

Note: A solution which only calculated the mean once might be preferable... one option is to stack, groupby, mean, and unstack, but atm this is a little fiddly.

This seems significantly faster than Vicktor's solution:

In [21]: df = pd.DataFrame(np.random.randn(100, 100))

In [22]: %timeit df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
1000 loops, best of 3: 1.64 ms per loop

In [23]: %timeit viktor()
1 loops, best of 3: 822 ms per loop

In fact, Viktor's solution crashes my (underpowered) laptop for larger DataFrames:

In [31]: df = pd.DataFrame(np.random.randn(1000, 1000))

In [32]: %timeit df.groupby(lambda x: x//2).mean().groupby(lambda y: y//2, axis=1).mean()
10 loops, best of 3: 42.9 ms per loop

In [33]: %timeit viktor()
# crashes

As Viktor points out, this doesn't work with non-integer index, if this was wanted, you could just store them as temp variables and feed them back in after:

df_index, df_cols, df.index, df.columns = df.index, df.columns, np.arange(len(df.index)), np.arange(len(df.columns))
res = df.groupby(...
res.index, res.columns = df_index[::2], df_cols[::2]
like image 82
Andy Hayden Avatar answered Nov 03 '22 03:11

Andy Hayden


You can use the rolling_mean function applied twice, first on the columns and then on the rows, and then slice the results:

rbs = 2 # row block size
cbs = 2 # column block size
pd.rolling_mean(pd.rolling_mean(df.T, cbs, center=True)[cbs-1::cbs].T,
                rbs)[rbs-1::rbs]

Which gives the same result you want, except the index will be different (but you can fix this using .reset_index(drop=True)):

      1     3
1  2.25  3.25
3  2.00  2.25

Timing info:

In [11]: df = pd.DataFrame(np.random.randn(100, 100))
In [12]: %%timeit
         pd.rolling_mean(pd.rolling_mean(df.T, 2, center=True)[1::2].T, 2)[1::2]
100 loops, best of 3: 4.75 ms per loop
In [13]: %%timeit
         df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean()
100 loops, best of 3: 932 µs per loop

So it's around 5x slower than the groupby not 800x :)

like image 43
Viktor Kerkez Avatar answered Nov 03 '22 02:11

Viktor Kerkez