Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify only "not duplicated" rows

Tags:

r

data.table

I have a situation like this. Multiple data.table "rbinded".

library(data.table)
x <-  data.table(id=c(1,2,3,4),dsp=c(5,6,7,8),status=c(FALSE,TRUE,FALSE,TRUE))
y <-  data.table(id=c(1,2,3,4),dsp=c(6,6,7,8),status=c(FALSE,FALSE,FALSE,TRUE))
z <- data.table(id=c(1,2,3,4),dsp=c(5,6,9,8),status=c(FALSE,TRUE,FALSE,FALSE))
w <- data.table(id=c(1,2,3,4),dsp=c(5,6,7,NA),status=c(FALSE,TRUE,FALSE,TRUE))
setkey(x,id)
setkey(y,id)
setkey(z,id)
setkey(w,id)
Bigdt<-rbind(x,y,z,w)

I would like to obtain ONLY the not repeated rows like:

id  dsp status
1   6   FALSE
2   6   FALSE
3   9   FALSE
4   8   FALSE
4   NA  TRUE

So i tried

Resultdt<-Bigdt[!duplicated(Bigdt)]

but the result:

id  dsp status
1   5   FALSE
2   6   TRUE
3   7   FALSE
4   8   TRUE

does not match my espectations. I tried in different methods (as rbind is not mandatory), for example merge, join etc. the data.table package seems potentially the one that contains the solution...apparently. Any ideas?

like image 534
Antonello Salis Avatar asked May 27 '16 14:05

Antonello Salis


People also ask

How do I select non duplicate rows in SQL?

If you want the query to return only unique rows, use the keyword DISTINCT after SELECT . DISTINCT can be used to fetch unique rows from one or more columns. You need to list the columns after the DISTINCT keyword.

How do you find non duplicates in Excel?

To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates. To highlight unique or duplicate values, use the Conditional Formatting command in the Style group on the Home tab.


2 Answers

You can do

Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL][]

   id dsp status
1:  1   6  FALSE
2:  2   6  FALSE
3:  3   9  FALSE
4:  4   8  FALSE
5:  4  NA   TRUE

To see how it works, run just part of the DT[][][][] chain:

  • Bigdt[, .N, by=names(Bigdt)]
  • Bigdt[, .N, by=names(Bigdt)][N == 1L]
  • Bigdt[, .N, by=names(Bigdt)][N == 1L][, N := NULL]
like image 130
Frank Avatar answered Sep 27 '22 16:09

Frank


You may also try

Bigdt[!(duplicated(Bigdt)|duplicated(Bigdt, fromLast=TRUE))]
#   id dsp status
#1:  1   6  FALSE
#2:  2   6  FALSE
#3:  3   9  FALSE
#4:  4   8  FALSE
#5:  4  NA   TRUE

Or if we are using .SD

Bigdt[Bigdt[,!(duplicated(.SD)|duplicated(.SD, fromLast=TRUE))]]

Or another option would be grouping by the column names, find the row index with .I and subset the dataset

Bigdt[Bigdt[, .I[.N==1], by = names(Bigdt)]$V1]
like image 45
akrun Avatar answered Sep 27 '22 18:09

akrun