"Pivot" a Pandas DataFrame into a 3D numpy array

Question

Given a DataFrame with the following structure:

Date     | Site  | Measurement Type | Value
-----------------------------------------------
1/1/2020 | A     | Temperature      | 32.3
1/2/2020 | B     | Humidity         | 70%

I would like to create a 3D "pivot table" where the first axis represents site, the second represents date, the third represents measurement type, and values are stored in each element.

For example, if I had daily measurements for one week at 5 sites, measuring both Temperature and Humidity, the desired output would be an array with shape (5, 7, 2).

Pandas only seems to support creating 2D pivot tables, but I'm happy with just an unlabeled 3D numpy array as output. Wondering if there's an existing easy way to do this before I spend time implementing it myself.

Andy L. · Accepted Answer

It is doable using df.pivot_table. I added one more row to your sample to have both Measurement Type. On missing values, it will be represented by np.nan

sample `df`

       Date Site Measurement_Type Value
0  1/1/2020    A      Temperature  32.3
1  1/1/2020    A         Humidity   60%
2  1/2/2020    B         Humidity   70%

Try the followings

iix = pd.MultiIndex.from_product([np.unique(df.Date), np.unique(df.Measurement_Type)])
df_pivot = (df.pivot_table('Value', 'Site', ['Date', 'Measurement_Type'], aggfunc='first')
              .reindex(iix, axis=1))
arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
                       .to_numpy().tolist())

print(arr)

Out[1447]:
array([[['60%', '32.3'],
        [nan, nan]],

       [[nan, nan],
        ['70%', nan]]], dtype=object)

Method 2: using pivot_table on different columns and numpy reshape

iix_n = pd.MultiIndex.from_product([np.unique(df.Site), np.unique(df.Date)])
arr = (df.pivot_table('Value', ['Site', 'Date'], 'Measurement_Type', aggfunc='first')
         .reindex(iix_n).to_numpy()
         .reshape(df.Site.nunique(),df.Date.nunique(),-1))

Out[1501]:
array([[['60%', '32.3'],
        [nan, nan]],

       [[nan, nan],
        ['70%', nan]]], dtype=object)

Denver · Answer

I think what you are looking for is something like a panel. You could also just use a 3-dim numpy array. For example using panel:

p_dim = {}

# desired columns
cols = ['Site', 'Measurement Type']

for date in df.Date:
    sub_df = df[df.Date.isin([date])].reset_index(drop=True)
    p_dim[date] = sub_df[[c for c in sub_df.columns if c in cols]]

panel = pd.Panel(p_dim)

Now you can access the various data associated with a date using panel['1/1/2020'] assuming your Date column is type str. To see all available keys you can use panel.keys().

Jean Lescut · Answer

I did a small script to measure perf of the different @Andy L. approaches.

Second method seems slightly faster :

def pivot_table_3d_1(df, col1, col2, col3, value, aggfunc='first') :
    iix = pd.MultiIndex.from_product([np.unique(df[col2]), np.unique(df[col3])])
    df_pivot = (df.pivot_table(value, col1, [col2,col3], aggfunc=aggfunc)
                  .reindex(iix, axis=1))
    arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
                           .to_numpy().tolist())

    return arr

def pivot_table_3d_2(df, col1, col2, col3, value, aggfunc='first') :
    iix_n = pd.MultiIndex.from_product([np.unique(df[col1]), np.unique(df[col2])])
    arr = (df.pivot_table(value, [col1, col2], col3, aggfunc=aggfunc)
         .reindex(iix_n).to_numpy()
         .reshape(df[col1].nunique(),df[col2].nunique(),-1))

    return arr

## TESTING
N1 = 100
N2 = 200
N3 = 300
df = pd.DataFrame({'col1': np.random.randint(0, N1, N1*N2*N3),
                   'col2': np.random.randint(0, N2, N1*N2*N3),
                   'col3': np.random.randint(0, N3, N1*N2*N3),
                   'value': np.random.normal(0,1,N1*N2*N3)})

%timeit pivot_table_3d(df, col1='col1', col2='col2', col3='col3', value='value')
# 10.2 s ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_3d_2(df, col1='col1', col2='col2', col3='col3', value='value')
#9.47 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

"Pivot" a Pandas DataFrame into a 3D numpy array

Tags:

python

pandas

numpy

pivot-table

LoLa

3 Answers

Andy L.

Denver

Jean Lescut

Recent Activity

Donate For Us

"Pivot" a Pandas DataFrame into a 3D numpy array

Tags:

python

pandas

numpy

pivot-table

LoLa

3 Answers

Andy L.

Denver

Jean Lescut

Related questions

Recent Activity

Donate For Us