Conditional deduplication in polars

Question

I have a dataset i'm trying to remove duplicate entries from. The lazyframe i'm working with is structured like this:

df = pl.from_repr("""
┌──────┬────────────┬──────────────────┬───────┐
│ id   ┆ title      ┆ type             ┆ type2 │
│ ---  ┆ ---        ┆ ---              ┆ ---   │
│ i64  ┆ str        ┆ str              ┆ i64   │
╞══════╪════════════╪══════════════════╪═══════╡
│ 1001 ┆ Research A ┆ journal article  ┆ 35    │
│ 1002 ┆ Research B ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research C ┆ journal article  ┆ 35    │
│ 1004 ┆ Research D ┆ conference paper ┆ 42    │
│ 1001 ┆ Research E ┆ journal article  ┆ 35    │
│ 1002 ┆ Research F ┆ journal article  ┆ 41    │
│ 1003 ┆ Research G ┆                  ┆ 41    │
│ 1002 ┆ Research I ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research J ┆ journal article  ┆ 35    │
└──────┴────────────┴──────────────────┴───────┘
""")

I want to remove entries that have the same id, but there are actually different cases:

The duplicates have the same type (e.g. 1001): keep the first one.
The duplicates have a different type: discard the ones with an empty string ("") as type, and then keep only the entries that respect the following pairs of type and type2:

dict_df = pl.DataFrame({
    "type": ['journal article', 'book chapter', 'book chapter'],
    "type2": [35, 41, 42]
})

Expected output

id[i64]	title[str]	type[str]	type2[i64]
1001	Research A	journal article	35
1002	Research B	book chapter	41
1003	Research C	journal article	35
1004	Research D	conference paper	42

1001: same type, keep the first one
1002: different type, keep the first occurrence of the entries with pair {'book chapter': 41}
1003: different type, discard the entries with an empty type and keep the first occurrence
1004: not a duplicate

I've tried many things, mainly using the pl.when() expression, but I cannot work out a way to filter the groups.

(
    df
    .sort('type', descending=True)
    .group_by("id")
    .agg([
        pl.when(pl.col("type").n_unique() > 1)
        .then( ... )
        .otherwise(pl.all().first())
    ])
)

orlp · Accepted Answer

Create a preference table indicating how much you want each combination:

preference = pl.DataFrame({
    "type": ["journal article", "book chapter", "book chapter"],
    "iris_type": [35, 41, 42],
    "preference": [0, 1, 2]
})

Join the preference table with your data table:

joined = df.lazy().join(preference.lazy(), on=["type", "iris_type"], how="left")

Sort the joined table on the preference, pick the first one in each group and drop the preference column:

out = (
    joined.sort("preference", descending=True, nulls_last=True)
        .group_by("id")
        .first()
        .drop("preference")
        .collect()
)

Conditional deduplication in polars

Tags:

python

python-polars

lzill

1 Answers

orlp

Recent Activity

Donate For Us

Conditional deduplication in polars

Tags:

python

python-polars

lzill

1 Answers

orlp

Related questions

Recent Activity

Donate For Us