Turn polars dataframe into nested dictionary

Question

I have a polars dataframe like such:

print(
    pl.DataFrame(
        {
            "file": ["A", "A", "A", "B", "B", "B"],
            "user": ["u1", "u2", "u3", "u1", "u2", "u3"],
            "data1": [1, 2, 3, 4, 5, 6],
            "data2": [7, 8, 9, 10, 11, 12],
            "data3": [13, 14, 15, 16, 17, 18],
        }
    )
)



shape: (6, 5)
┌──────┬──────┬───────┬───────┬───────┐
│ file ┆ user ┆ data1 ┆ data2 ┆ data3 │
│ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   │
│ str  ┆ str  ┆ i64   ┆ i64   ┆ i64   │
╞══════╪══════╪═══════╪═══════╪═══════╡
│ A    ┆ u1   ┆ 1     ┆ 7     ┆ 13    │
│ A    ┆ u2   ┆ 2     ┆ 8     ┆ 14    │
│ A    ┆ u3   ┆ 3     ┆ 9     ┆ 15    │
│ B    ┆ u1   ┆ 4     ┆ 10    ┆ 16    │
│ B    ┆ u2   ┆ 5     ┆ 11    ┆ 17    │
│ B    ┆ u3   ┆ 6     ┆ 12    ┆ 18    │
└──────┴──────┴───────┴───────┴───────┘

And I would like to turn it into a bested dictionary like this

{
    'file': {
        'A': {
            'user': {
                'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}
            }
        },
        'B': {
            'user': {
                'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}
            }
        }
    }
}

I tried my luck with grouping and selecting structs, but I'm unable to do the conversion using only polars functions. What's the right approach here?

EDIT:

I tried two things:

Using rows_by_key:

df.rows_by_key(key=["file", "user"], unique=True, named=True)

but that gives the keys as tuples, i.e.,

{('A', 'u1'): {'data1': 1, 'data2': 7, 'data3': 13},
 ('A', 'u2'): {'data1': 2, 'data2': 8, 'data3': 14},
 ('A', 'u3'): {'data1': 3, 'data2': 9, 'data3': 15},
 ('B', 'u1'): {'data1': 4, 'data2': 10, 'data3': 16},
 ('B', 'u2'): {'data1': 5, 'data2': 11, 'data3': 17},
 ('B', 'u3'): {'data1': 6, 'data2': 12, 'data3': 18}}

I also tried to manually loop over the groups:

d = {}
for file, df_file in (df).group_by("file", maintain_order=True):
    d_user = {}
    for user, df_file_user in df_file.group_by("user", maintain_order=True):
        d_user[user[0]] = df_file_user.drop("file", "user").to_dicts()[0]
    d[file[0]] = {"user": d_user}
print({"file": d})

This yields the right output:

{
    'file': {
        'A': {
            'user': {
                'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}
            }
        },
        'B': {
            'user': {
                'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}
            }
        }
    }
}

But this feels cumbersome and I wonder whether there's a more direct way of doing this.

Grismar · Accepted Answer

Starting with what you already had:

import polars as pl

df = pl.DataFrame(
    {
        "file": ["A", "A", "A", "B", "B", "B"],
        "user": ["u1", "u2", "u3", "u1", "u2", "u3"],
        "data1": [1, 2, 3, 4, 5, 6],
        "data2": [7, 8, 9, 10, 11, 12],
        "data3": [13, 14, 15, 16, 17, 18],
    }
)

rows = df.rows_by_key(key=["file", "user"], unique=True, named=True)

result = {
    "file": {
        file: {
            user: values
            for (f, user), values in rows.items()
            if f == file
        }
        for file in dict.fromkeys(f for f, _ in rows)
    }
}

This avoids the nested loop you have, but still loops twice in the comprehension, after having constructed rows like you did.

It would perhaps be preferable to do it all in one go like this instead:

result = {
    "file": {
        file: {
            row["user"]: {k: row[k] for k in row if k not in ("file", "user")}
            for row in group.to_dicts()
        }
        for (file,), group in df.partition_by(["file"], as_dict=True).items()
    }
}

The result (for both):

from pprint import pprint
pprint(result)

Output:

{'file': {'A': {'u1': {'data1': 1, 'data2': 7, 'data3': 13},
                'u2': {'data1': 2, 'data2': 8, 'data3': 14},
                'u3': {'data1': 3, 'data2': 9, 'data3': 15}},
          'B': {'u1': {'data1': 4, 'data2': 10, 'data3': 16},
                'u2': {'data1': 5, 'data2': 11, 'data3': 17},
                'u3': {'data1': 6, 'data2': 12, 'data3': 18}}}}

Perhaps a more appropriate question would be: why do you need it in the format you proposed? There may be a better way to achieve your ultimate goal from the original data, if you have no control over how the original data is structured in the first place.

Turn polars dataframe into nested dictionary

Tags:

python

python-polars

Thomas

1 Answers

Grismar

Recent Activity

Donate For Us

Turn polars dataframe into nested dictionary

Tags:

python

python-polars

Thomas

1 Answers

Grismar

Related questions

Recent Activity

Donate For Us