My dataframe contains a multivariate time series per user id. The first column id is the user id (there are N users), the second dt is the date (each user has T days worth of data, i.,e T rows for each user) and the other columns are metrics (basically, each column is a time series per id.) Here's a code to recreate a similar dataframe
import pandas as pd
from datetime import datetime
import numpy as np
N=5
T=100
dfs=[]
datelist = pd.date_range(datetime.today(), periods=T).tolist()
for id in range(N):
test = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
test['dt'] = datelist
test['id']=id
dfs.append(test)
dfs = pd.concat(dfs)
The output would look something like this, where 'A','B' and so on are metrics (like total purchases):
| A | B | C | D | dt | id | |
|---|---|---|---|---|---|---|
| 0 | 58 | 3 | 52 | 5 | 2023-07-05 14:34:12.852460 | 0 |
| 1 | 6 | 34 | 28 | 88 | 2023-07-06 14:34:12.852460 | 0 |
| 2 | 27 | 98 | 74 | 81 | 2023-07-07 14:34:12.852460 | 0 |
| 3 | 96 | 13 | 7 | 52 | 2023-07-08 14:34:12.852460 | 0 |
| 4 | 80 | 69 | 22 | 12 | 2023-07-09 14:34:12.852460 | 0 |
I want to transform this data into a numpy matrix X, of shape N x T x F, where N is the number of users, T is the number of days in the time series (T is constant for all ids) and F is the number of metrics (in the example above, F=4.)
This means that X[0] returns a TxF array, that should look exactly like the output of dfs.query('id==0')[['A','B','C','D']].values
So far, I've tried using pivot and reshape but the elements in the final matrix are not arranged as I would like. Here's what I've tried:
# Pivot the dataframe
df_pivot = dfs.sort_values(['id','dt']).pivot(index='id', columns='dt')
# Get the values from the pivot table
X = df_pivot.values.reshape(dfs['id'].nunique(), -1, len([x for x in dfs.columns if x not in ['dt','id']]))
If I do X[0], the result I get it:
[58, 6, 27, 96],
[80, 65, 41, 39],
[30, 26, 38, 13],
[50, 60, 60, 73],
...
From which you can see that the result is not what I would want. This is what I need:
[58, 3, 52, 5],
[ 6, 34, 28, 88],
[27, 98, 74, 81],
[96, 13, 7, 52],
...
Any help appreciated!
Note that numpy array will only contain arrays of similar shapes.
Assumming this is the case you could do:
r = sum(df.id == 0)
c = df.shape[1] - 2
arr = df.drop(columns = ['dt', 'id']).values.reshape(-1, r, c)
arr[0]
array([[58, 3, 52, 5],
[ 6, 34, 28, 88],
[27, 98, 74, 81],
[96, 13, 7, 52],
[80, 69, 22, 12]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With