Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find *all* duplicated records in data.table (not all-but-one)

Tags:

r

data.table

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated(), I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - but in data.table::duplicated(), fromLast=TRUE is not included (I don't know why)...

P.S. ok, here's a primitive example

myDT <- fread( "id,fB,fC  1, b1,c1  2, b2,c2  3, b1,c1  4, b3,c3  5, b1,c1 ") setkeyv(myDT, c('fB', 'fC')) myDT[, fD:=duplicated(myDT)] 

rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated while I need to mark all of them.

UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey all these columns explicitly. So far I use the following workaround specifically for this case:

dups1 <- duplicated(myDT); dups2 <- duplicated(myDT, fromLast=T); dups <- dups1 | dups2; 
like image 553
Vasily A Avatar asked Oct 15 '13 22:10

Vasily A


People also ask

How do you find duplicate record in a table?

One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.

How can you filter duplicate data while retrieving records from a table?

Once you have grouped data you can filter out duplicates by using having clause. Having clause is the counterpart of where clause for aggregation queries. Just remember to provide a temporary name to count() data in order to use them in having clause.


1 Answers

This appears to work:

> myDT[unique(myDT),fD:=.N>1] > myDT    id  fB fC    fD 1:  1  b1 c1  TRUE 2:  3  b1 c1  TRUE 3:  5  b1 c1  TRUE 4:  2  b2 c2 FALSE 5:  4  b3 c3 FALSE 

Thanks to @flodel, the better way to do it is this:

> myDT[, fD := .N > 1, by = key(myDT)] > myDT    id  fB fC    fD 1:  1  b1 c1  TRUE 2:  3  b1 c1  TRUE 3:  5  b1 c1  TRUE 4:  2  b2 c2 FALSE 5:  4  b3 c3 FALSE 

The difference in efficiency is substantial:

> microbenchmark(     key=myDT[, fD := .N > 1, by = key(myDT)],     unique=myDT[unique(myDT),fD:=.N>1]) Unit: microseconds    expr      min       lq    median        uq       max neval     key  679.874  715.700  735.0575  773.7595  1825.437   100  unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100 

Especially for the max. What's going on there?

like image 195
mrip Avatar answered Oct 06 '22 09:10

mrip