Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Turn polars dataframe into nested dictionary

I have a polars dataframe like such:

print(
    pl.DataFrame(
        {
            "file": ["A", "A", "A", "B", "B", "B"],
            "user": ["u1", "u2", "u3", "u1", "u2", "u3"],
            "data1": [1, 2, 3, 4, 5, 6],
            "data2": [7, 8, 9, 10, 11, 12],
            "data3": [13, 14, 15, 16, 17, 18],
        }
    )
)



shape: (6, 5)
┌──────┬──────┬───────┬───────┬───────┐
│ file ┆ user ┆ data1 ┆ data2 ┆ data3 │
│ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   │
│ str  ┆ str  ┆ i64   ┆ i64   ┆ i64   │
╞══════╪══════╪═══════╪═══════╪═══════╡
│ A    ┆ u1   ┆ 1     ┆ 7     ┆ 13    │
│ A    ┆ u2   ┆ 2     ┆ 8     ┆ 14    │
│ A    ┆ u3   ┆ 3     ┆ 9     ┆ 15    │
│ B    ┆ u1   ┆ 4     ┆ 10    ┆ 16    │
│ B    ┆ u2   ┆ 5     ┆ 11    ┆ 17    │
│ B    ┆ u3   ┆ 6     ┆ 12    ┆ 18    │
└──────┴──────┴───────┴───────┴───────┘

And I would like to turn it into a bested dictionary like this

{
    'file': {
        'A': {
            'user': {
                'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}
            }
        },
        'B': {
            'user': {
                'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}
            }
        }
    }
}

I tried my luck with grouping and selecting structs, but I'm unable to do the conversion using only polars functions. What's the right approach here?

EDIT:

I tried two things:

  1. Using rows_by_key:
df.rows_by_key(key=["file", "user"], unique=True, named=True)

but that gives the keys as tuples, i.e.,

{('A', 'u1'): {'data1': 1, 'data2': 7, 'data3': 13},
 ('A', 'u2'): {'data1': 2, 'data2': 8, 'data3': 14},
 ('A', 'u3'): {'data1': 3, 'data2': 9, 'data3': 15},
 ('B', 'u1'): {'data1': 4, 'data2': 10, 'data3': 16},
 ('B', 'u2'): {'data1': 5, 'data2': 11, 'data3': 17},
 ('B', 'u3'): {'data1': 6, 'data2': 12, 'data3': 18}}
  1. I also tried to manually loop over the groups:
d = {}
for file, df_file in (df).group_by("file", maintain_order=True):
    d_user = {}
    for user, df_file_user in df_file.group_by("user", maintain_order=True):
        d_user[user[0]] = df_file_user.drop("file", "user").to_dicts()[0]
    d[file[0]] = {"user": d_user}
print({"file": d})

This yields the right output:

{
    'file': {
        'A': {
            'user': {
                'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}
            }
        },
        'B': {
            'user': {
                'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}
            }
        }
    }
}

But this feels cumbersome and I wonder whether there's a more direct way of doing this.

like image 286
Thomas Avatar asked Apr 23 '26 21:04

Thomas


1 Answers

Starting with what you already had:

import polars as pl

df = pl.DataFrame(
    {
        "file": ["A", "A", "A", "B", "B", "B"],
        "user": ["u1", "u2", "u3", "u1", "u2", "u3"],
        "data1": [1, 2, 3, 4, 5, 6],
        "data2": [7, 8, 9, 10, 11, 12],
        "data3": [13, 14, 15, 16, 17, 18],
    }
)

rows = df.rows_by_key(key=["file", "user"], unique=True, named=True)

result = {
    "file": {
        file: {
            user: values
            for (f, user), values in rows.items()
            if f == file
        }
        for file in dict.fromkeys(f for f, _ in rows)
    }
}

This avoids the nested loop you have, but still loops twice in the comprehension, after having constructed rows like you did.

It would perhaps be preferable to do it all in one go like this instead:

result = {
    "file": {
        file: {
            row["user"]: {k: row[k] for k in row if k not in ("file", "user")}
            for row in group.to_dicts()
        }
        for (file,), group in df.partition_by(["file"], as_dict=True).items()
    }
}

The result (for both):

from pprint import pprint
pprint(result)

Output:

{'file': {'A': {'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}},
          'B': {'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}}}}

Perhaps a more appropriate question would be: why do you need it in the format you proposed? There may be a better way to achieve your ultimate goal from the original data, if you have no control over how the original data is structured in the first place.

like image 55
Grismar Avatar answered Apr 26 '26 12:04

Grismar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!