I know there are duplicate rows in a large dataframe because unique() results in a smaller dataframe.
I'd like to get those duplicates to help figure out where they are coming from.
I see references to various functions involving duplicates for earlier versions but can't make any of them work for .6
So how can I create a dataframe that contains the duplicate rows contained in another dataframe?
DataFrames has the nonunique
function that returns a logical mask that has true values where the rows are not unique:
julia> df = DataFrame(X=rand(1:3, 10), Y=rand(10:13,10))
10×2 DataFrames.DataFrame
│ Row │ X │ Y │
├─────┼───┼────┤
│ 1 │ 2 │ 11 │
│ 2 │ 1 │ 10 │
│ 3 │ 2 │ 13 │
│ 4 │ 2 │ 13 │
│ 5 │ 2 │ 13 │
│ 6 │ 1 │ 10 │
│ 7 │ 2 │ 10 │
│ 8 │ 3 │ 13 │
│ 9 │ 2 │ 12 │
│ 10 │ 1 │ 11 │
julia> nonunique(df)
10-element Array{Bool,1}:
false
false
false
true
true
true
false
false
false
false
You can covert the logical mask into linear indices with findall
:
julia> findall(nonunique(df))
3-element Array{Int64,1}:
4
5
6
To build on @mbauman. You may want to display the actual data with
df[findall(nonunique(df)), :]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With