Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using either melt or wide_to_long for large dataset with inconsistent naming

I have a large dataset that is difficult to investigate without analysis tools. It's general form is this, but with 16 "ItemType0" columns and 16 "ItemType1", "ItemType2", etc columns.

It represents the properties (many of them) of up to 16 different items recorded at a single timestep, then properties of that timestep.

Time ItemType0[0].property ItemType0[1].property Property
1 1 0 2
2 0 1 2
3 3 3 2

I'd like to receive:

Time ItemType0.property Property
1 1 2
2 0 2
3 3 2
1 0 2
2 1 2
3 3 2
import pandas as pd

wide_df = pd.DataFrame({
    "Time": [1,2,3],
    "ItemType0[0].property": [1,0,3],
    "ItemType0[1].property": [0,1,3],
    "Property": [2,2,2]})

What I've tried:

  1. Melt:

    ids = [col for col in wide_df.columns if "[" not in col]
    inter_df = pd.melt(wide_df, id_vars=ids, var_name="Source")
    

    MemoryError: Unable to allocate 28.3 GiB for an array with shape (15,506831712) and data type uint32

  2. I wouldn't even know where to begin with pd.wide_to_long as everything doesn't start with the same.

like image 512
KerbFusion Avatar asked Jan 19 '26 12:01

KerbFusion


1 Answers

IIUC, you can try to group the properties by ItemTypeX and then explode:

df.columns = df.columns.str.replace(r"\[\d+\]", "", regex=True)

df = df.set_index(["Time", "Property"])
df = df.T.groupby(df.columns).agg(list).T

print(df.reset_index().explode(df.columns.to_list()))

Prints:

   Time  Property ItemType0.property
0     1         2                  1
0     1         2                  0
1     2         2                  0
1     2         2                  1
2     3         2                  3
2     3         2                  3
like image 68
Andrej Kesely Avatar answered Jan 22 '26 00:01

Andrej Kesely



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!