Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Pivot" a Pandas DataFrame into a 3D numpy array

Given a DataFrame with the following structure:

Date     | Site  | Measurement Type | Value
-----------------------------------------------
1/1/2020 | A     | Temperature      | 32.3
1/2/2020 | B     | Humidity         | 70%

I would like to create a 3D "pivot table" where the first axis represents site, the second represents date, the third represents measurement type, and values are stored in each element.

For example, if I had daily measurements for one week at 5 sites, measuring both Temperature and Humidity, the desired output would be an array with shape (5, 7, 2).

Pandas only seems to support creating 2D pivot tables, but I'm happy with just an unlabeled 3D numpy array as output. Wondering if there's an existing easy way to do this before I spend time implementing it myself.

like image 539
LoLa Avatar asked Dec 31 '19 16:12

LoLa


3 Answers

It is doable using df.pivot_table. I added one more row to your sample to have both Measurement Type. On missing values, it will be represented by np.nan

sample `df`

       Date Site Measurement_Type Value
0  1/1/2020    A      Temperature  32.3
1  1/1/2020    A         Humidity   60%
2  1/2/2020    B         Humidity   70%

Try the followings

iix = pd.MultiIndex.from_product([np.unique(df.Date), np.unique(df.Measurement_Type)])
df_pivot = (df.pivot_table('Value', 'Site', ['Date', 'Measurement_Type'], aggfunc='first')
              .reindex(iix, axis=1))
arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
                       .to_numpy().tolist())

print(arr)

Out[1447]:
array([[['60%', '32.3'],
        [nan, nan]],

       [[nan, nan],
        ['70%', nan]]], dtype=object)

Method 2: using pivot_table on different columns and numpy reshape

iix_n = pd.MultiIndex.from_product([np.unique(df.Site), np.unique(df.Date)])
arr = (df.pivot_table('Value', ['Site', 'Date'], 'Measurement_Type', aggfunc='first')
         .reindex(iix_n).to_numpy()
         .reshape(df.Site.nunique(),df.Date.nunique(),-1))

Out[1501]:
array([[['60%', '32.3'],
        [nan, nan]],

       [[nan, nan],
        ['70%', nan]]], dtype=object)
like image 88
Andy L. Avatar answered Oct 16 '22 20:10

Andy L.


I think what you are looking for is something like a panel. You could also just use a 3-dim numpy array. For example using panel:

p_dim = {}

# desired columns
cols = ['Site', 'Measurement Type']

for date in df.Date:
    sub_df = df[df.Date.isin([date])].reset_index(drop=True)
    p_dim[date] = sub_df[[c for c in sub_df.columns if c in cols]]

panel = pd.Panel(p_dim)

Now you can access the various data associated with a date using panel['1/1/2020'] assuming your Date column is type str. To see all available keys you can use panel.keys().

like image 39
Denver Avatar answered Oct 16 '22 22:10

Denver


I did a small script to measure perf of the different @Andy L. approaches.

Second method seems slightly faster :

def pivot_table_3d_1(df, col1, col2, col3, value, aggfunc='first') :
    iix = pd.MultiIndex.from_product([np.unique(df[col2]), np.unique(df[col3])])
    df_pivot = (df.pivot_table(value, col1, [col2,col3], aggfunc=aggfunc)
                  .reindex(iix, axis=1))
    arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
                           .to_numpy().tolist())

    return arr

def pivot_table_3d_2(df, col1, col2, col3, value, aggfunc='first') :
    iix_n = pd.MultiIndex.from_product([np.unique(df[col1]), np.unique(df[col2])])
    arr = (df.pivot_table(value, [col1, col2], col3, aggfunc=aggfunc)
         .reindex(iix_n).to_numpy()
         .reshape(df[col1].nunique(),df[col2].nunique(),-1))

    return arr

## TESTING
N1 = 100
N2 = 200
N3 = 300
df = pd.DataFrame({'col1': np.random.randint(0, N1, N1*N2*N3),
                   'col2': np.random.randint(0, N2, N1*N2*N3),
                   'col3': np.random.randint(0, N3, N1*N2*N3),
                   'value': np.random.normal(0,1,N1*N2*N3)})

%timeit pivot_table_3d(df, col1='col1', col2='col2', col3='col3', value='value')
# 10.2 s ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_3d_2(df, col1='col1', col2='col2', col3='col3', value='value')
#9.47 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 39
Jean Lescut Avatar answered Oct 16 '22 20:10

Jean Lescut