I have a dataframe that I'd like to explore and look only the duplicated rows based on two or more columns.
For example:
df = pl.DataFrame({"A": [1, 6, 5, 4, 5, 6],
"B": ["A", "B", "C", "D", "C", "A"],
"C": [2, 2, 2, 1, 1, 1]})
I'd like to return duplicate combination for columns A and B only. I've tried:
df.filter(pl.col("A", "B").is_duplicated()) # Returns: This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.
When adding .all() in between, the result is the same as above
df.filter(pl.col("A", "B").all().is_duplicated()) # Same as above
Unique with keep "none" returns the opposite result I'd like to have, so tried the below:
df.unique(subset=["A", "B"], keep="none").is_not() # 'DataFrame' object has no attribute 'is_not'
Expected output would be to see only the rows:
shape: (2, 3)
┌─────┬─────┬─────┐
│ A | B | C │
│ --- | --- | --- │
│ i64 | str | i64 │
╞═════╪═════╪═════╡
│ 5 | C | 2 │
│ 5 | C | 1 │
└─────┴─────┴─────┘
A struct allows you to "combine" multiple expressions together as a "single entity" (or "key").
Which you can use .is_duplicated() on.
df.filter(pl.struct("A", "B").is_duplicated())
shape: (2, 3)
┌─────┬─────┬─────┐
│ A | B | C │
│ --- | --- | --- │
│ i64 | str | i64 │
╞═════╪═════╪═════╡
│ 5 | C | 2 │
│ 5 | C | 1 │
└─────┴─────┴─────┘
We can use LazyFrame.explain() to inspect the query plan.
print(
df.lazy().select(pl.col("A", "B").is_duplicated()).explain()
)
SELECT [col("A").is_duplicated(), col("B").is_duplicated()] FROM
DF ["A", "B", "C"]; PROJECT 2/3 COLUMNS; SELECTION: None
pl.col("A", "B").is_duplicated() expands in to 2 separate calls.
pl.col("A").is_duplicated()
pl.col("B").is_duplicated()
When you create a struct, .is_duplicated() is only called once.
df.select(pl.struct("A", "B"))
shape: (6, 1)
┌───────────┐
│ A │
│ --- │
│ struct[2] │
╞═══════════╡
│ {1,"A"} │
│ {6,"B"} │
│ {5,"C"} │
│ {4,"D"} │
│ {5,"C"} │
│ {6,"A"} │
└───────────┘
In the .is_duplicated() documentation we have the following example:
Examples
df = pl.DataFrame(
{
"a": [1, 2, 3, 1],
"b": ["x", "y", "z", "x"],
}
)
df.is_duplicated()
shape: (4,)
Series: '' [bool]
[
true
false
false
true
]
This mask can be used to visualize the duplicated lines like this:
df.filter(df.is_duplicated())
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ x │
│ 1 ┆ x │
└─────┴─────┘
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With