Setup:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
output:
c d e
a b
0.439502 0.115087 0.832546 0.760513 0.776555
0.609107 0.247642 0.031650 0.727773
0.995370 0.299640 0.053523 0.565753 0.857235
0.392132 0.832560 0.774653 0.213692
Each data series is grouped by the index ID a
and b
represents a time index for the other features of a
. Is there a way to get the pandas to produce a numpy 3d array that reflects the a
groupings? Currently it reads the data as two dimensional so pdf.shape
outputs (4, 5)
. What I would like is for the array to be of the variable form:
array([[[-1.38655912, -0.90145951, -0.95106951, 0.76570984],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576],
[-0.21004144, -2.66498267, -0.29255182, 1.43411576]],
[[ 0.0768149 , -0.7566995 , -2.57770951, 0.70834656],
[-0.99097395, -0.81592084, -1.21075386, 0.12361382]]])
Is there a native Pandas way to do this? Note that number of rows per a
grouping in the actual data is variable, so I cannot just transpose or reshape pdf.values
. If there isn't a native way, what's the best method for iteratively constructing the arrays from hundreds of thousands of rows and hundreds of columns?
I just had an extremely similar problem and solved it like this:
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
output:
array([[[ 0.47780308, 0.93422319, 0.00526572, 0.41645868, 0.82089215],
[ 0.47780308, 0.15372096, 0.20948369, 0.76354447, 0.27743855]],
[[ 0.75146799, 0.39133973, 0.25182206, 0.78088926, 0.30276705],
[ 0.75146799, 0.42182369, 0.01166461, 0.00936464, 0.53208731]]])
verifying it is 3d, a3d.shape gives (2, 2, 5).
Lastly, to make the newly created dimension the last dimension (instead of the first) then use:
a3d = np.dstack(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)))
which has a shape of (2, 5, 2)
For cases where the data is ragged (as brought up by CharlesG in the comments) you can use something like the following if you want to stick to a numpy solution. But be aware that the best strategy to deal with missing data varies from case to case. In this example we simply add zeros for the missing rows.
Example setup with ragged shape:
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf.set_index(['a','b'])
dataframe:
c d e
a b
0.460013 0.577535 0.299304 0.617103 0.378887
0.167907 0.244972 0.615077 0.311497
0.318823 0.640575 0.768187 0.652760 0.822311
0.424744 0.958405 0.659617 0.998765
0.077048 0.407182 0.758903 0.273737
One possible solution:
n_max = pdf.groupby('a').size().max()
a3d = np.array(list(pdf.groupby('a').apply(pd.DataFrame.as_matrix)
.apply(lambda x: np.pad(x, ((0, n_max-len(x)), (0, 0)), 'constant'))))
a3d.shape gives (2, 3, 5)
panel.values
will return a numpy array directly. this will by necessity be the highest acceptable dtype as everything is smushed into a single 3-d numpy array. It will be new array and not a view of the pandas data (no matter the dtype).
as_matrix
is deprecated, and here we assume first key is a
, then groups in a
may have different length, this method solve all the problem .
import pandas as pd
import numpy as np
from typing import List
def make_cube(df: pd.DataFrame, idx_cols: List[str]) -> np.ndarray:
"""Make an array cube from a Dataframe
Args:
df: Dataframe
idx_cols: columns defining the dimensions of the cube
Returns:
multi-dimensional array
"""
assert len(set(idx_cols) & set(df.columns)) == len(idx_cols), 'idx_cols must be subset of columns'
df = df.set_index(keys=idx_cols) # don't overwrite a parameter, thus copy!
idx_dims = [len(level) + 1 for level in df.index.levels]
idx_dims.append(len(df.columns))
cube = np.empty(idx_dims)
cube.fill(np.nan)
cube[tuple(np.array(df.index.to_list()).T)] = df.values
return cube
Test:
pdf = pd.DataFrame(np.random.rand(4,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
# a, b must be integer
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give : (2, 2, 3)
pdf = pd.DataFrame(np.random.rand(5,5), columns = list('abcde'))
pdf['a'][2:]=pdf['a'][0]
pdf['a'][:2]=pdf['a'][1]
pdf1 = (pdf.assign(a=lambda df: df.groupby(['a']).ngroup())
.assign(b=lambda df: df.groupby(['a'])['b'].cumcount())
)
make_cube(pdf1, ['a', 'b']).shape
give s (2, 3, 3) .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With