Joining polars dataframes while ignoring duplicate values in the on column

Question

Given the code

df1 = pl.DataFrame({"A": [1, 1], "B": [3, 4]})
df2 = pl.DataFrame({"A": [1, 1], "C": [5, 6]})
result = df1.join(df2, on='A')

result looks like

shape: (4, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   │
│ 1   ┆ 4   ┆ 5   │
│ 1   ┆ 3   ┆ 6   │
│ 1   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

but I would like it to be

shape: (2, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   │
│ 1   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

Experimenting with left_on, right_on and how parameters did not resolve this issue.

Hericks · Accepted Answer

If the tables are aligned (possibly, after sorting by the on column), you can concatenate them horizontally using pl.concat.

pl.concat([df1.sort("A"), df2.sort("A").drop("A")], how="horizontal")

shape: (2, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   │
│ 1   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

If one table contains more rows and you want to keep the extra rows, you could add an extra column to store the index of the row among all rows with same on column(s). Then, you can join the tables using both on column(s) and the index.

(
    df1
    .with_columns(pl.int_range(pl.len()).over("A").alias("id"))
    .join(
        df2.with_columns(pl.int_range(pl.len()).over("A").alias("id")),
        on=["A", "id"],
        how="left",
    )
    .select(pl.exclude("id"))
)

Joining polars dataframes while ignoring duplicate values in the on column

Tags:

python

dataframe

python-polars

vinnewbie

1 Answers

Hericks

Recent Activity

Donate For Us

Joining polars dataframes while ignoring duplicate values in the on column

Tags:

python

dataframe

python-polars

vinnewbie

1 Answers

Hericks

Related questions

Recent Activity

Donate For Us