julia: find duplicate rows in dataframes

Question

I know there are duplicate rows in a large dataframe because unique() results in a smaller dataframe.

I'd like to get those duplicates to help figure out where they are coming from.

I see references to various functions involving duplicates for earlier versions but can't make any of them work for .6

So how can I create a dataframe that contains the duplicate rows contained in another dataframe?

mbauman · Accepted Answer

DataFrames has the nonunique function that returns a logical mask that has true values where the rows are not unique:

julia> df = DataFrame(X=rand(1:3, 10), Y=rand(10:13,10))
10×2 DataFrames.DataFrame
│ Row │ X │ Y  │
├─────┼───┼────┤
│ 1   │ 2 │ 11 │
│ 2   │ 1 │ 10 │
│ 3   │ 2 │ 13 │
│ 4   │ 2 │ 13 │
│ 5   │ 2 │ 13 │
│ 6   │ 1 │ 10 │
│ 7   │ 2 │ 10 │
│ 8   │ 3 │ 13 │
│ 9   │ 2 │ 12 │
│ 10  │ 1 │ 11 │

julia> nonunique(df)
10-element Array{Bool,1}:
 false
 false
 false
  true
  true
  true
 false
 false
 false
 false

You can covert the logical mask into linear indices with findall:

julia> findall(nonunique(df))
3-element Array{Int64,1}:
 4
 5
 6

Jovansam · Answer

To build on @mbauman. You may want to display the actual data with

 df[findall(nonunique(df)), :]

julia: find duplicate rows in dataframes

Tags:

julia

Chuck Carlson

2 Answers

mbauman

Jovansam

Recent Activity

Donate For Us

julia: find duplicate rows in dataframes

Tags:

julia

Chuck Carlson

2 Answers

mbauman

Jovansam

Related questions

Recent Activity

Donate For Us