Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - find and list duplicate rows based on two columns

Tags:

r

Using R. Base package, dplyr, or data.table are all okay for me to use. My data is ~1000 rows x 20 columns. I expect about 300 duplicates.

I'd like to do something like the following, but with one alteration:

Match/group duplicate rows (indices)

I'd like to find, not fully duplicated rows, but rows duplicated in two columns. For example, given this input table:

File     T.N     ID     Col1     Col2
BAI.txt   T      1       sdaf    eiri
BAJ.txt   N      2       fdd     fds
BBK.txt   T      1       ter     ase
BCD.txt   N      1       twe     ase

If I want to find duplicates in T.N & ID only, I'd end up with the following table:

File     T.N     ID     Col1     Col2
BAI.txt   T      1       sdaf    eiri
BBK.txt   T      1       ter     ase
like image 576
Gaius Augustus Avatar asked Mar 11 '16 22:03

Gaius Augustus


People also ask

How do I remove duplicate rows from two columns in R?

distinct() function can be used to filter out the duplicate rows. We just have to pass our R object and the column name as an argument in the distinct() function.

How do you find common values in two columns in R?

To find the common elements between two columns of an R data frame, we can use intersect function.


3 Answers

I have found this to be an easy and useful method.

tr <- tribble(~File,     ~TN,     ~ID,    ~Col1,     ~Col2,
              'BAI.txt',   'T',      1,       'sdaf',    'eiri',
              'BAJ.txt',   'N',     2,      'fdd',     'fds',
              'BBK.txt',   'T',      1,       'ter',     'ase',
              'BCD.txt',   'N',      1,       'twe',     'ase')

group_by(tr, TN, ID) %>% 
  filter(n() > 1)

Output:

# A tibble: 2 x 5
# Groups:   TN, ID [1]
  File    TN       ID Col1  Col2 
  <chr>   <chr> <dbl> <chr> <chr>
1 BAI.txt T         1 sdaf  eiri 
2 BBK.txt T         1 ter   ase  
like image 50
Robin Avatar answered Nov 09 '22 10:11

Robin


Here is an option using duplicated twice, second time along with fromLast = TRUE option because it returns TRUE only from the duplicate value on-wards

dupe = data[,c('T.N','ID')] # select columns to check duplicates
data[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]

#     File T.N ID Col1 Col2
#1 BAI.txt   T  1 sdaf eiri
#3 BBK.txt   T  1  ter  ase
like image 22
Veerendra Gadekar Avatar answered Nov 09 '22 10:11

Veerendra Gadekar


A simple solution is find_duplicates from hablar

library(dplyr)
library(data.table)
library(hablar)

df <- fread("
  File     T.N     ID     Col1     Col2
  BAI.txt   T      1       sdaf    eiri
  BAJ.txt   N      2       fdd     fds
  BBK.txt   T      1       ter     ase
  BCD.txt   N      1       twe     ase
            ")

df %>% 
  find_duplicates(T.N, ID)

which returns the rows with duplicates in T.N and ID:

  File    T.N      ID Col1  Col2 
  <chr>   <chr> <int> <chr> <chr>
1 BAI.txt T         1 sdaf  eiri 
2 BBK.txt T         1 ter   ase 
like image 43
davsjob Avatar answered Nov 09 '22 10:11

davsjob