if I understand correctly, duplicated()
function for data.table
returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated()
, I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE))
- but in data.table::duplicated()
, fromLast=TRUE
is not included (I don't know why)...
P.S. ok, here's a primitive example
myDT <- fread( "id,fB,fC 1, b1,c1 2, b2,c2 3, b1,c1 4, b3,c3 5, b1,c1 ") setkeyv(myDT, c('fB', 'fC')) myDT[, fD:=duplicated(myDT)]
rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated
while I need to mark all of them.
UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey
all these columns explicitly. So far I use the following workaround specifically for this case:
dups1 <- duplicated(myDT); dups2 <- duplicated(myDT, fromLast=T); dups <- dups1 | dups2;
One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.
Once you have grouped data you can filter out duplicates by using having clause. Having clause is the counterpart of where clause for aggregation queries. Just remember to provide a temporary name to count() data in order to use them in having clause.
This appears to work:
> myDT[unique(myDT),fD:=.N>1] > myDT id fB fC fD 1: 1 b1 c1 TRUE 2: 3 b1 c1 TRUE 3: 5 b1 c1 TRUE 4: 2 b2 c2 FALSE 5: 4 b3 c3 FALSE
Thanks to @flodel, the better way to do it is this:
> myDT[, fD := .N > 1, by = key(myDT)] > myDT id fB fC fD 1: 1 b1 c1 TRUE 2: 3 b1 c1 TRUE 3: 5 b1 c1 TRUE 4: 2 b2 c2 FALSE 5: 4 b3 c3 FALSE
The difference in efficiency is substantial:
> microbenchmark( key=myDT[, fD := .N > 1, by = key(myDT)], unique=myDT[unique(myDT),fD:=.N>1]) Unit: microseconds expr min lq median uq max neval key 679.874 715.700 735.0575 773.7595 1825.437 100 unique 1417.845 1485.913 1522.7475 1567.9065 24053.645 100
Especially for the max. What's going on there?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With