Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unlike Numpy, Pandas doesn't seem to like memory strides

Pandas seems to be missing a R-style matrix-level rolling window function (rollapply(..., by.column = FALSE)), providing only the vector based version. Thus I tried to follow this question and it works beautifully with the example which can be replicated, but it doesn't work with pandas DataFrames even when using the (seemingly identical) underlying Numpy array.

Artificial problem replication:

import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided

test = [[x * y for x in range(1, 10)] for y in [10**z for z in range(5)]]
mm = np.array(test, dtype = np.int64)
pp = pd.DataFrame(test).values

mm and pp look identical:

enter image description here

The numpy directly-derived matrix gives me what I want perfectly:

as_strided(mm, (mm.shape[0] - 3 + 1, 3, mm.shape[1]), (mm.shape[1] * 8, mm.shape[1] * 8, 8))

That is, it gives me 3 strides of 3 rows each, in a 3d matrix, allowing me to perform computations on a submatrix moving down by one row at a time.

enter image description here

But the pandas-derived version (identical call with mm replaced by pp):

 as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (pp.shape[1] * 8, pp.shape[1] * 8, 8))

enter image description here

is all weird like it's transposed somehow. Is this to do with column/row major order stuff?

I need to do matrix sliding windows in Pandas, and this seems my best shot, especially because it is really fast. What's going on here? How do I get the underlying Pandas array to behave like Numpy?

like image 971
Thomas Browne Avatar asked Apr 16 '15 11:04

Thomas Browne


1 Answers

It seems that the .values returns the underlying data in Fortran order (as you speculated):

>>> mm.flags # NumPy array
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  ...

>>> pp.flags # array from DataFrame
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  ...

This confuses as_strided which expects the data to be arranged in C order in memory.

To fix things, you could copy the data in C order and use the same strides as in your question:

pp = pp.copy('C') 

Alternatively, if you want to avoid copying large amounts of data, adjust the strides to acknowledge the column-order layout of the data:

as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (8, 8, pp.shape[0]*8))
like image 110
Alex Riley Avatar answered Nov 15 '22 20:11

Alex Riley