Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joining polars dataframes while ignoring duplicate values in the on column

Given the code

df1 = pl.DataFrame({"A": [1, 1], "B": [3, 4]})
df2 = pl.DataFrame({"A": [1, 1], "C": [5, 6]})
result = df1.join(df2, on='A')

result looks like

shape: (4, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   │
│ 1   ┆ 4   ┆ 5   │
│ 1   ┆ 3   ┆ 6   │
│ 1   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

but I would like it to be

shape: (2, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   │
│ 1   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

Experimenting with left_on, right_on and how parameters did not resolve this issue.

like image 735
vinnewbie Avatar asked Nov 19 '25 15:11

vinnewbie


1 Answers

If the tables are aligned (possibly, after sorting by the on column), you can concatenate them horizontally using pl.concat.

pl.concat([df1.sort("A"), df2.sort("A").drop("A")], how="horizontal")
shape: (2, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   │
│ 1   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

If one table contains more rows and you want to keep the extra rows, you could add an extra column to store the index of the row among all rows with same on column(s). Then, you can join the tables using both on column(s) and the index.

(
    df1
    .with_columns(pl.int_range(pl.len()).over("A").alias("id"))
    .join(
        df2.with_columns(pl.int_range(pl.len()).over("A").alias("id")),
        on=["A", "id"],
        how="left",
    )
    .select(pl.exclude("id"))
)
like image 195
Hericks Avatar answered Nov 21 '25 03:11

Hericks