why does pandas rolling use single dimension ndarray

Tags:

I was motivated to use pandas rolling feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply after a df.rolling(2) and take the resulting pd.DataFrame extract the ndarray with .values and perform the requisite matrix multiplication. It didn't work out that way.

Here is what I found:

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
X = np.random.rand(2, 1).round(2)

What do objects look like:

print "\ndf = \n", df
print "\nX = \n", X
print "\ndf.shape =", df.shape, ", X.shape =", X.shape

df = 
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

X = 
[[ 0.93]
 [ 0.83]]

df.shape = (5, 2) , X.shape = (2L, 1L)

Matrix multiplication behaves normally:

df.values.dot(X)

array([[ 0.7495],
       [ 0.8179],
       [ 0.4444],
       [ 1.4711],
       [ 1.3562]])

Using apply to perform row by row dot product behaves as expected:

df.apply(lambda x: x.values.dot(X)[0], axis=1)

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

Groupby -> Apply behaves as I'd expect:

df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])

0    0.7495
1    0.8179
2    0.4444
3    1.4711
4    1.3562
dtype: float64

But when I run:

df.rolling(1).apply(lambda x: x.values.dot(X))

I get:

AttributeError: 'numpy.ndarray' object has no attribute 'values'

Ok, so pandas is using straight ndarray within its rolling implementation. I can handle that. Instead of using .values to get the ndarray, let's try:

df.rolling(1).apply(lambda x: x.dot(X))

shapes (1,) and (2,1) not aligned: 1 (dim 0) != 2 (dim 0)

Wait! What?!

So I created a custom function to look at the what rolling is doing.

def print_type_sum(x):
    print type(x), x.shape
    return x.sum()

Then ran:

print df.rolling(1).apply(print_type_sum)

<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

My resulting pd.DataFrame is the same, that's good. But it printed out 10 single dimensional ndarray objects. What about rolling(2)

print df.rolling(2).apply(print_type_sum)

<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
      A     B
0   NaN   NaN
1  0.90  0.88
2  0.92  0.49
3  1.31  0.84
4  1.63  1.58

Same thing, expect output but it printed 8 ndarray objects. rolling is producing a single dimensional ndarray of length window for each column as opposed to what I expected which was an ndarray of shape (window, len(df.columns)).

Question is Why?

I now don't have a way to easily run a rolling multi-factor regression.

308

asked May 27 '16 14:05

piRSquared

2 Answers

I wanted to share what I've done to work around this problem.

Given a pd.DataFrame and a window, I generate a stacked ndarray using np.dstack (see answer). I then convert it to a pd.Panel and using pd.Panel.to_frame convert it to a pd.DataFrame. At this point, I have a pd.DataFrame that has an additional level on its index relative to the original pd.DataFrame and the new level contains information about each rolled period. For example, if the roll window is 3, the new index level will contain be [0, 1, 2]. An item for each period. I can now groupby level=0 and return the groupby object. This now gives me an object that I can much more intuitively manipulate.

Roll Function

import pandas as pd
import numpy as np

def roll(df, w):
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    panel = pd.Panel(roll_array, 
                     items=df.index[w-1:],
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
    return panel.to_frame().unstack().T.groupby(level=0)

Demonstration

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

print df

      A     B
0  0.44  0.41
1  0.46  0.47
2  0.46  0.02
3  0.85  0.82
4  0.78  0.76

Let's sum

rolled_df = roll(df, 2)

print rolled_df.sum()

major     A     B
1      0.90  0.88
2      0.92  0.49
3      1.31  0.84
4      1.63  1.58

To peek under the hood, we can see the stucture:

print rolled_df.apply(lambda x: x)

major      A     B
  roll            
1 0     0.44  0.41
  1     0.46  0.47
2 0     0.46  0.47
  1     0.46  0.02
3 0     0.46  0.02
  1     0.85  0.82
4 0     0.85  0.82
  1     0.78  0.76

But what about the purpose for which I built this, rolling multi-factor regression. But I'll settle for matrix multiplication for now.

X = np.array([2, 3])

print rolled_df.apply(lambda df: pd.Series(df.values.dot(X))) 

      0     1
1  2.11  2.33
2  2.33  0.98
3  0.98  4.16
4  4.16  3.84

147

answered Oct 16 '22 22:10

piRSquared

Using the strides views concept on dataframe, here's a vectorized approach -

get_sliding_window(df, 2).dot(X) # window size = 2

Runtime test -

In [101]: df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])

In [102]: X = np.array([2, 3])

In [103]: rolled_df = roll(df, 2)

In [104]: %timeit rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
100 loops, best of 3: 5.51 ms per loop

In [105]: %timeit get_sliding_window(df, 2).dot(X)
10000 loops, best of 3: 43.7 µs per loop

Verify results -

In [106]: rolled_df.apply(lambda df: pd.Series(df.values.dot(X)))
Out[106]: 
      0     1
1  2.70  4.09
2  4.09  2.52
3  2.52  1.78
4  1.78  3.50

In [107]: get_sliding_window(df, 2).dot(X)
Out[107]: 
array([[ 2.7 ,  4.09],
       [ 4.09,  2.52],
       [ 2.52,  1.78],
       [ 1.78,  3.5 ]])

Huge improvement there, which I am hoping would stay noticeable on larger arrays!

answered Oct 16 '22 21:10

Divakar

Related questions
                            
                                Django ListView customising queryset
                            
                                python dictionary error AttributeError: 'list' object has no attribute 'keys'
                            
                                How to encode python dictionary?
                            
                                Most Efficient Way to "Slurp" All of STDIN Into a String
                            
                                How to input arguments after compiling python program with PyInstaller
                            
                                Get area within contours Opencv Python?
                            
                                Displaying dictionary data in Sphinx documentation
                            
                                Python asyncio task got bad yield
                            
                                What does `python setup.py check` actually do?
                            
                                "ImportError: no module named 'requests'" after installing with pip
                            
                                Django Rest Framework: empty request.data
                            
                                loc function in pandas
                            
                                Python: limit the width of printed columns of pandas DataFrame
                            
                                Pylint message: Invalid constant name (invalid-name)
                            
                                Groupby and lag all columns of a dataframe?
                            
                                Tensorflow "map operation" for tensor?
                            
                                unresolved attribute "Column" in class "SQLAlchemy"
                            
                                How to reverse a dictionary (whose values are lists) in Python?
                            
                                How can I separate runs of my TensorFlow code in TensorBoard?
                            
                                what is the correct way to check for False? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

why does pandas rolling use single dimension ndarray

Tags:

python

pandas

group-by

numpy

pandas-groupby

Question is Why?

piRSquared

People also ask

2 Answers

Roll Function

Demonstration

piRSquared

Divakar

Recent Activity

Donate For Us