find all duplicated records in data.table (not all-but-one)

Q: How do you find duplicate record in a table?

One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.

Q: How can you filter duplicate data while retrieving records from a table?

Once you have grouped data you can filter out duplicates by using having clause. Having clause is the counterpart of where clause for aggregation queries. Just remember to provide a temporary name to count() data in order to use them in having clause.

Tags:

r

data.table

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated(), I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - but in data.table::duplicated(), fromLast=TRUE is not included (I don't know why)...

P.S. ok, here's a primitive example

myDT <- fread( "id,fB,fC  1, b1,c1  2, b2,c2  3, b1,c1  4, b3,c3  5, b1,c1 ") setkeyv(myDT, c('fB', 'fC')) myDT[, fD:=duplicated(myDT)]

rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated while I need to mark all of them.

UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey all these columns explicitly. So far I use the following workaround specifically for this case:

dups1 <- duplicated(myDT); dups2 <- duplicated(myDT, fromLast=T); dups <- dups1 | dups2;

553

asked Oct 15 '13 22:10

Vasily A

1 Answers

This appears to work:

> myDT[unique(myDT),fD:=.N>1] > myDT    id  fB fC    fD 1:  1  b1 c1  TRUE 2:  3  b1 c1  TRUE 3:  5  b1 c1  TRUE 4:  2  b2 c2 FALSE 5:  4  b3 c3 FALSE

Thanks to @flodel, the better way to do it is this:

> myDT[, fD := .N > 1, by = key(myDT)] > myDT    id  fB fC    fD 1:  1  b1 c1  TRUE 2:  3  b1 c1  TRUE 3:  5  b1 c1  TRUE 4:  2  b2 c2 FALSE 5:  4  b3 c3 FALSE

The difference in efficiency is substantial:

> microbenchmark(     key=myDT[, fD := .N > 1, by = key(myDT)],     unique=myDT[unique(myDT),fD:=.N>1]) Unit: microseconds    expr      min       lq    median        uq       max neval     key  679.874  715.700  735.0575  773.7595  1825.437   100  unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100

Especially for the max. What's going on there?

195

answered Oct 06 '22 09:10

mrip

Related questions
                            
                                is there a way to extend LETTERS past 26 characters e.g., AA, AB, AC...?
                            
                                Easier way to plot the cumulative frequency distribution in ggplot?
                            
                                Add error bars to show standard deviation on a plot in R
                            
                                What programming languages are good for statistics? [closed]
                            
                                Removing html tags from a string in R
                            
                                R-Project no applicable method for 'meta' applied to an object of class "character"
                            
                                Safely creating S3 Generics in R
                            
                                Cluster one-dimensional data optimally? [closed]
                            
                                R - plot human body in 2d [closed]
                            
                                Cartogram + choropleth map in R
                            
                                What are productive ways to debug Rcpp compiled code loaded in R (on OS X Mavericks)?
                            
                                Using lapply to apply a function over list of data frames and saving output to files with different names
                            
                                How do you handle R Data internal to a package?
                            
                                Command to see 'R' path that RStudio is using
                            
                                What's the differences between & and &&, | and || in R? [duplicate]
                            
                                Preventing performance regressions in R
                            
                                Why is enquo + !! preferable to substitute + eval
                            
                                Automatically adjust LaTeX table width to fit pdf using knitr and Rstudio
                            
                                Finding the index inside a vector satisfying a condition
                            
                                R: apt-get install r-cran-foo vs. install.packages("foo")

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

find all duplicated records in data.table (not all-but-one)

Tags:

r

data.table

Vasily A

People also ask

1 Answers

mrip

Recent Activity

Donate For Us

find *all* duplicated records in data.table (not all-but-one)

Tags:

r

data.table

Vasily A

People also ask

1 Answers

mrip

Related questions

Recent Activity

Donate For Us

find all duplicated records in data.table (not all-but-one)