Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

julia: find duplicate rows in dataframes

Tags:

julia

I know there are duplicate rows in a large dataframe because unique() results in a smaller dataframe.

I'd like to get those duplicates to help figure out where they are coming from.

I see references to various functions involving duplicates for earlier versions but can't make any of them work for .6

So how can I create a dataframe that contains the duplicate rows contained in another dataframe?

like image 758
Chuck Carlson Avatar asked Jul 10 '17 20:07

Chuck Carlson


2 Answers

DataFrames has the nonunique function that returns a logical mask that has true values where the rows are not unique:

julia> df = DataFrame(X=rand(1:3, 10), Y=rand(10:13,10))
10×2 DataFrames.DataFrame
│ Row │ X │ Y  │
├─────┼───┼────┤
│ 1   │ 2 │ 11 │
│ 2   │ 1 │ 10 │
│ 3   │ 2 │ 13 │
│ 4   │ 2 │ 13 │
│ 5   │ 2 │ 13 │
│ 6   │ 1 │ 10 │
│ 7   │ 2 │ 10 │
│ 8   │ 3 │ 13 │
│ 9   │ 2 │ 12 │
│ 10  │ 1 │ 11 │

julia> nonunique(df)
10-element Array{Bool,1}:
 false
 false
 false
  true
  true
  true
 false
 false
 false
 false

You can covert the logical mask into linear indices with findall:

julia> findall(nonunique(df))
3-element Array{Int64,1}:
 4
 5
 6
like image 152
mbauman Avatar answered Sep 19 '22 01:09

mbauman


To build on @mbauman. You may want to display the actual data with

 df[findall(nonunique(df)), :]
like image 39
Jovansam Avatar answered Sep 18 '22 01:09

Jovansam