Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe or Panel to 3d numpy array

Setup:

pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

output:

                         c           d           e
a           b           
0.439502    0.115087     0.832546    0.760513    0.776555
            0.609107     0.247642    0.031650    0.727773
0.995370    0.299640     0.053523    0.565753    0.857235
            0.392132     0.832560    0.774653    0.213692

Each data series is grouped by the index ID a and b represents a time index for the other features of a. Is there a way to get the pandas to produce a numpy 3d array that reflects the a groupings? Currently it reads the data as two dimensional so pdf.shape outputs (4, 5). What I would like is for the array to be of the variable form:

array([[[-1.38655912, -0.90145951, -0.95106951,  0.76570984],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576],
        [-0.21004144, -2.66498267, -0.29255182,  1.43411576]],

       [[ 0.0768149 , -0.7566995 , -2.57770951,  0.70834656],
        [-0.99097395, -0.81592084, -1.21075386,  0.12361382]]])

Is there a native Pandas way to do this? Note that number of rows per a grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?

like image 556
o1lo01ol1o Avatar asked May 05 '14 17:05

o1lo01ol1o


3 Answers

I just had an extremely similar problem and solved it like this:

a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

output:

array([[[ 0.47780308,  0.93422319,  0.00526572,  0.41645868,  0.82089215],
    [ 0.47780308,  0.15372096,  0.20948369,  0.76354447,  0.27743855]],

   [[ 0.75146799,  0.39133973,  0.25182206,  0.78088926,  0.30276705],
    [ 0.75146799,  0.42182369,  0.01166461,  0.00936464,  0.53208731]]])

verifying it is 3d, a3d.shape gives (2, 2, 5).

Lastly, to make the newly created dimension the last dimension (instead of the first) then use:

a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))

which has a shape of (2, 5, 2)


For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.

Example setup with ragged shape:

pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])

dataframe:

                        c           d           e
a           b           
0.460013    0.577535    0.299304    0.617103    0.378887
            0.167907    0.244972    0.615077    0.311497
0.318823    0.640575    0.768187    0.652760    0.822311
            0.424744    0.958405    0.659617    0.998765
            0.077048    0.407182    0.758903    0.273737

One possible solution:

n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
                    .apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))

a3d.shape gives (2, 3, 5)

like image 152
Leo Avatar answered Oct 03 '22 18:10

Leo


panel.values

will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).

like image 33
Jeff Avatar answered Oct 03 '22 19:10

Jeff


as_matrix is deprecated, and here we assume first key is a , then groups in a may have different length, this method solve all the problem .

import pandas as pd
import numpy as np
from typing import List

def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
    """Make an array cube from a Dataframe

    Args:
        df: Dataframe
        idx_cols: columns defining the dimensions of the cube

    Returns:
        multi-dimensional array
    """
    assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'

    df = df.set_index(keys=idx_cols)  # don't overwrite a parameter, thus copy!
    idx_dims = [len(level) + 1 for level in df.index.levels]
    idx_dims.append(len(df.columns))

    cube = np.empty(idx_dims)
    cube.fill(np.nan)
    cube[tuple(np.array(df.index.to_list()).T)] = df.values

    return cube

Test:


pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

# a, b must be integer 
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give : (2, 2, 3)


pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]

pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)

make_cube(pdf1, ['a', 'b']).shape

give s (2, 3, 3) .

like image 23
Mithril Avatar answered Oct 03 '22 19:10

Mithril