Given a DataFrame with the following structure:
Date | Site | Measurement Type | Value
-----------------------------------------------
1/1/2020 | A | Temperature | 32.3
1/2/2020 | B | Humidity | 70%
I would like to create a 3D "pivot table" where the first axis represents site, the second represents date, the third represents measurement type, and values are stored in each element.
For example, if I had daily measurements for one week at 5 sites, measuring both Temperature and Humidity, the desired output would be an array with shape (5, 7, 2).
Pandas only seems to support creating 2D pivot tables, but I'm happy with just an unlabeled 3D numpy array as output. Wondering if there's an existing easy way to do this before I spend time implementing it myself.
It is doable using df.pivot_table
. I added one more row to your sample to have both Measurement Type
. On missing values, it will be represented by np.nan
sample `df`
Date Site Measurement_Type Value
0 1/1/2020 A Temperature 32.3
1 1/1/2020 A Humidity 60%
2 1/2/2020 B Humidity 70%
Try the followings
iix = pd.MultiIndex.from_product([np.unique(df.Date), np.unique(df.Measurement_Type)])
df_pivot = (df.pivot_table('Value', 'Site', ['Date', 'Measurement_Type'], aggfunc='first')
.reindex(iix, axis=1))
arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
.to_numpy().tolist())
print(arr)
Out[1447]:
array([[['60%', '32.3'],
[nan, nan]],
[[nan, nan],
['70%', nan]]], dtype=object)
Method 2: using pivot_table
on different columns and numpy reshape
iix_n = pd.MultiIndex.from_product([np.unique(df.Site), np.unique(df.Date)])
arr = (df.pivot_table('Value', ['Site', 'Date'], 'Measurement_Type', aggfunc='first')
.reindex(iix_n).to_numpy()
.reshape(df.Site.nunique(),df.Date.nunique(),-1))
Out[1501]:
array([[['60%', '32.3'],
[nan, nan]],
[[nan, nan],
['70%', nan]]], dtype=object)
I think what you are looking for is something like a panel
. You could also just use a 3-dim numpy array. For example using panel:
p_dim = {}
# desired columns
cols = ['Site', 'Measurement Type']
for date in df.Date:
sub_df = df[df.Date.isin([date])].reset_index(drop=True)
p_dim[date] = sub_df[[c for c in sub_df.columns if c in cols]]
panel = pd.Panel(p_dim)
Now you can access the various data associated with a date using panel['1/1/2020']
assuming your Date
column is type str
. To see all available keys you can use panel.keys()
.
I did a small script to measure perf of the different @Andy L. approaches.
Second method seems slightly faster :
def pivot_table_3d_1(df, col1, col2, col3, value, aggfunc='first') :
iix = pd.MultiIndex.from_product([np.unique(df[col2]), np.unique(df[col3])])
df_pivot = (df.pivot_table(value, col1, [col2,col3], aggfunc=aggfunc)
.reindex(iix, axis=1))
arr = np.array(df_pivot.groupby(level=0, axis=1).agg(lambda x: [*x.values])
.to_numpy().tolist())
return arr
def pivot_table_3d_2(df, col1, col2, col3, value, aggfunc='first') :
iix_n = pd.MultiIndex.from_product([np.unique(df[col1]), np.unique(df[col2])])
arr = (df.pivot_table(value, [col1, col2], col3, aggfunc=aggfunc)
.reindex(iix_n).to_numpy()
.reshape(df[col1].nunique(),df[col2].nunique(),-1))
return arr
## TESTING
N1 = 100
N2 = 200
N3 = 300
df = pd.DataFrame({'col1': np.random.randint(0, N1, N1*N2*N3),
'col2': np.random.randint(0, N2, N1*N2*N3),
'col3': np.random.randint(0, N3, N1*N2*N3),
'value': np.random.normal(0,1,N1*N2*N3)})
%timeit pivot_table_3d(df, col1='col1', col2='col2', col3='col3', value='value')
# 10.2 s ± 39.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pivot_table_3d_2(df, col1='col1', col2='col2', col3='col3', value='value')
#9.47 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With