Pandas Dataframe or Panel to 3d numpy array

Question

Setup:

pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

output:

                         c           d           e
a           b           
0.439502    0.115087     0.832546    0.760513    0.776555
            0.609107     0.247642    0.031650    0.727773
0.995370    0.299640     0.053523    0.565753    0.857235
            0.392132     0.832560    0.774653    0.213692

Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:

array([[[-1.38655912, -0.90145951, -0.95106951,  0.76570984],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576]],

       [[ 0.0768149 , -0.7566995 , -2.57770951,  0.70834656],
        [-0.99097395, -0.81592084, -1.21075386,  0.12361382]]])

Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

Leo · Accepted Answer

I just had an extremely similar problem and solved it like this:

a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

output:

array([[[ 0.47780308,  0.93422319,  0.00526572,  0.41645868,  0.82089215],
    [ 0.47780308,  0.15372096,  0.20948369,  0.76354447,  0.27743855]],

   [[ 0.75146799,  0.39133973,  0.25182206,  0.78088926,  0.30276705],
    [ 0.75146799,  0.42182369,  0.01166461,  0.00936464,  0.53208731]]])

verifying it is 3d, a3d.shape gives (2, 2, 5).

Lastly, to make the newly created dimension the last dimension (instead of the first) then use:

a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

which has a shape of (2, 5, 2)

For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.

Example setup with ragged shape:

pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

dataframe:

                        c           d           e
a           b           
0.460013    0.577535    0.299304    0.617103    0.378887
            0.167907    0.244972    0.615077    0.311497
0.318823    0.640575    0.768187    0.652760    0.822311
            0.424744    0.958405    0.659617    0.998765
            0.077048    0.407182    0.758903    0.273737

One possible solution:

n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
                    .apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))

a3d.shape gives (2, 3, 5)

Jeff · Answer

panel.values

will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

Mithril · Answer

as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .

import pandas as pd
import numpy as np
from typing import List

def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
    """Make an array cube from a Dataframe

    Args:
        df: Dataframe
        idx_cols: columns defining the dimensions of the cube

    Returns:
        multi-dimensional array
    """
    assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'

    df = df.set_index(keys=idx_cols)  # don't overwrite a parameter, thus copy!
    idx_dims = [len(level) + 1 for level in df.index.levels]
    idx_dims.append(len(df.columns))

    cube = np.empty(idx_dims)
    cube.fill(np.nan)
    cube[tuple(np.array(df.index.to_list()).T)] = df.values

    return cube

Test:


pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

# a, b must be integer 
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give : (2, 2, 3)


pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give s (2, 3, 3) .

Pandas Dataframe or Panel to 3d numpy array

Tags:

python

pandas

numpy

o1lo01ol1o

3 Answers

Leo

Jeff

Mithril

Recent Activity

Donate For Us

Pandas Dataframe or Panel to 3d numpy array

Tags:

python

pandas

numpy

o1lo01ol1o

3 Answers

Leo

Jeff

Mithril

Related questions

Recent Activity

Donate For Us