Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question

Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?


Example

Suppose I set up a DataFrame like

from pandas import DataFrame, MultiIndex

index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0))
print frame

which outputs

     value
0 0      0
  1      1
  2      3
1 1      5
  2      6

The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using

print frame.unstack().values

which outputs

[[  0.   1.   2.]
 [ nan   4.   5.]]

How does this generalize to an n-level index?

Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.

I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.

Any suggestions are highly appreciated.

like image 922
Igor Raush Avatar asked Jan 27 '16 20:01

Igor Raush


1 Answers

Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.

# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)

# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat

Original solution. Given a setup similar to above, but in 3-D,

from pandas import DataFrame, MultiIndex
from itertools import product

index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
                  index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)

we have

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
  1 0      6
    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.

First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.

levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)

which outputs

       value
0 0 0      0
    1      1
  1 0      2
    1      3
1 0 0      4
    1    NaN
  1 0      6
    1      7

Now, reshape() will work as intended.

shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))

which outputs

[[[  0.   1.]
  [  2.   3.]]

 [[  4.  nan]
  [  6.   7.]]]

The (rather ugly) one-liner is

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
     .reshape(map(len, frame.index.levels))
like image 93
Igor Raush Avatar answered Nov 15 '22 15:11

Igor Raush