Given the code
df1 = pl.DataFrame({"A": [1, 1], "B": [3, 4]})
df2 = pl.DataFrame({"A": [1, 1], "C": [5, 6]})
result = df1.join(df2, on='A')
result looks like
shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
│ 1 ┆ 4 ┆ 5 │
│ 1 ┆ 3 ┆ 6 │
│ 1 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
but I would like it to be
shape: (2, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
│ 1 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
Experimenting with left_on, right_on and how parameters did not resolve this issue.
If the tables are aligned (possibly, after sorting by the on column), you can concatenate them horizontally using pl.concat.
pl.concat([df1.sort("A"), df2.sort("A").drop("A")], how="horizontal")
shape: (2, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
│ 1 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
If one table contains more rows and you want to keep the extra rows, you could add an extra column to store the index of the row among all rows with same on column(s). Then, you can join the tables using both on column(s) and the index.
(
df1
.with_columns(pl.int_range(pl.len()).over("A").alias("id"))
.join(
df2.with_columns(pl.int_range(pl.len()).over("A").alias("id")),
on=["A", "id"],
how="left",
)
.select(pl.exclude("id"))
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With