How can I reshape my dataframe into a 3-dimensional numpy array?

How can I reshape my dataframe into a 3-dimensional numpy array?

How can I reshape my dataframe into a 3-dimensional numpy array?

Donate For Us

Related questions

Donate For Us

Tags:

1 Answers

Recent Activity

Tags:

1 Answers

Recent Activity

python

pandas

dataframe

numpy

Steven Cunden

KU99

python

pandas

dataframe

numpy

Steven Cunden

KU99

Question

My dataframe contains a multivariate time series per user id. The first column id is the user id (there are N users), the second dt is the date (each user has T days worth of data, i.,e T rows for each user) and the other columns are metrics (basically, each column is a time series per id.) Here's a code to recreate a similar dataframe

import pandas as pd
from datetime import datetime
import numpy as np

N=5
T=100

dfs=[]
datelist = pd.date_range(datetime.today(), periods=T).tolist()

for id in range(N):
    test = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
    test['dt'] = datelist
    test['id']=id
    dfs.append(test)


dfs = pd.concat(dfs)

The output would look something like this, where 'A','B' and so on are metrics (like total purchases):

I want to transform this data into a numpy matrix X, of shape N x T x F, where N is the number of users, T is the number of days in the time series (T is constant for all ids) and F is the number of metrics (in the example above, F=4.)

This means that X[0] returns a TxF array, that should look exactly like the output of dfs.query('id==0')[['A','B','C','D']].values

So far, I've tried using pivot and reshape but the elements in the final matrix are not arranged as I would like. Here's what I've tried:

# Pivot the dataframe
df_pivot = dfs.sort_values(['id','dt']).pivot(index='id', columns='dt')

# Get the values from the pivot table
X = df_pivot.values.reshape(dfs['id'].nunique(), -1, len([x for x in dfs.columns if x not in ['dt','id']]))

If I do X[0], the result I get it:

[58,  6, 27, 96],
[80, 65, 41, 39],
[30, 26, 38, 13],
[50, 60, 60, 73],
...

From which you can see that the result is not what I would want. This is what I need:

[58,  3, 52,  5],
[ 6, 34, 28, 88],
[27, 98, 74, 81],
[96, 13,  7, 52],
...

Any help appreciated!

KU99 · Accepted Answer

Note that numpy array will only contain arrays of similar shapes.

Assumming this is the case you could do:

r = sum(df.id == 0)
c = df.shape[1] - 2
arr = df.drop(columns = ['dt', 'id']).values.reshape(-1, r, c)

arr[0]
array([[58,  3, 52,  5],
       [ 6, 34, 28, 88],
       [27, 98, 74, 81],
       [96, 13,  7, 52],
       [80, 69, 22, 12]])

2023-07-05 14:34:12.852460

2023-07-06 14:34:12.852460

2023-07-07 14:34:12.852460

2023-07-08 14:34:12.852460

2023-07-09 14:34:12.852460