Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional deduplication in polars

I have a dataset i'm trying to remove duplicate entries from. The lazyframe i'm working with is structured like this:

df = pl.from_repr("""
┌──────┬────────────┬──────────────────┬───────┐
│ id   ┆ title      ┆ type             ┆ type2 │
│ ---  ┆ ---        ┆ ---              ┆ ---   │
│ i64  ┆ str        ┆ str              ┆ i64   │
╞══════╪════════════╪══════════════════╪═══════╡
│ 1001 ┆ Research A ┆ journal article  ┆ 35    │
│ 1002 ┆ Research B ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research C ┆ journal article  ┆ 35    │
│ 1004 ┆ Research D ┆ conference paper ┆ 42    │
│ 1001 ┆ Research E ┆ journal article  ┆ 35    │
│ 1002 ┆ Research F ┆ journal article  ┆ 41    │
│ 1003 ┆ Research G ┆                  ┆ 41    │
│ 1002 ┆ Research I ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research J ┆ journal article  ┆ 35    │
└──────┴────────────┴──────────────────┴───────┘
""")

I want to remove entries that have the same id, but there are actually different cases:

  1. The duplicates have the same type (e.g. 1001): keep the first one.
  2. The duplicates have a different type: discard the ones with an empty string ("") as type, and then keep only the entries that respect the following pairs of type and type2:
dict_df = pl.DataFrame({
    "type": ['journal article', 'book chapter', 'book chapter'],
    "type2": [35, 41, 42]
})

Expected output

id[i64] title[str] type[str] type2[i64]
1001 Research A journal article 35
1002 Research B book chapter 41
1003 Research C journal article 35
1004 Research D conference paper 42
  • 1001: same type, keep the first one
  • 1002: different type, keep the first occurrence of the entries with pair {'book chapter': 41}
  • 1003: different type, discard the entries with an empty type and keep the first occurrence
  • 1004: not a duplicate

I've tried many things, mainly using the pl.when() expression, but I cannot work out a way to filter the groups.

(
    df
    .sort('type', descending=True)
    .group_by("id")
    .agg([
        pl.when(pl.col("type").n_unique() > 1)
        .then( ... )
        .otherwise(pl.all().first())
    ])
)

like image 1000
lzill Avatar asked Sep 01 '25 20:09

lzill


1 Answers

  1. Create a preference table indicating how much you want each combination:
preference = pl.DataFrame({
    "type": ["journal article", "book chapter", "book chapter"],
    "iris_type": [35, 41, 42],
    "preference": [0, 1, 2]
})
  1. Join the preference table with your data table:
joined = df.lazy().join(preference.lazy(), on=["type", "iris_type"], how="left")
  1. Sort the joined table on the preference, pick the first one in each group and drop the preference column:
out = (
    joined.sort("preference", descending=True, nulls_last=True)
        .group_by("id")
        .first()
        .drop("preference")
        .collect()
)
like image 148
orlp Avatar answered Sep 03 '25 19:09

orlp