Fastest way to remove all duplicates in R

Tags:

I'd like to remove all items that appear more than once in a vector. Specifically, this includes character, numeric and integer vectors. Currently, I'm using duplicated() both forwards and backward (using the fromLast parameter).

Is there a more computationally efficient (faster) way to execute this in R? The solution below is simple enough to write/read, but it seems inefficient to execute the duplicate search twice. Perhaps a counting-based method using an additional data structure would be better?

Example:

d <- c(1,2,3,4,1,5,6,4,2,1)
d[!(duplicated(d) | duplicated(d, fromLast=TRUE))]
#[1] 3 5 6

Related SO posts here and here.

465

asked May 10 '16 20:05

Megatron

1 Answers

Some timings:

set.seed(1001)
d <- sample(1:100000, 100000, replace=T)
d <- c(d, sample(d, 20000, replace=T))  # ensure many duplicates
mb <- microbenchmark::microbenchmark(
  d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
  setdiff(d, d[duplicated(d)]),
  {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
  as.integer(names(table(d)[table(d)==1])),
  d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
  d[!(d %in% d[duplicated(d)])],
  { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
  d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))]
)
summary(mb)[, c(1, 4)]  # in milliseconds
#                                                                                expr      mean
#1                               d[!(duplicated(d) | duplicated(d, fromLast = TRUE))]  18.34692
#2                                                       setdiff(d, d[duplicated(d)])  24.84984
#3                       {     tmp <- rle(sort(d))     tmp$values[tmp$lengths == 1] }   9.53831
#4                                         as.integer(names(table(d)[table(d) == 1])) 255.76300
#5               d[!(duplicated.default(d) | duplicated.default(d, fromLast = TRUE))]  18.35360
#6                                                      d[!(d %in% d[duplicated(d)])]  24.01009
#7                        {     ud = unique(d)     ud[tabulate(match(d, ud)) == 1L] }  32.10166
#8 d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d,      F, T, NA)))]  18.33475

Given the comments let's see if they are all correct?

 results <- list(d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
         setdiff(d, d[duplicated(d)]),
         {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
         as.integer(names(table(d)[table(d)==1])),
         d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
         d[!(d %in% d[duplicated(d)])],
         { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
         d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))])
 all(sapply(ls, all.equal, c(3, 5, 6)))
 # TRUE

165

answered Oct 07 '22 00:10

Raad

Related questions
                            
                                shapiro.test(..) cannot deal with more than 5000 data points
                            
                                rCharts with Highcharts as shiny application
                            
                                Legend of a raster map with categorical data
                            
                                melt multiple groups of measure.vars
                            
                                R: Avoid accidently overwriting variables
                            
                                05:00:00 - 28:59:59 time format
                            
                                NumPy percentile function different from MATLAB's percentile function
                            
                                Cannot use dput for data.table in R
                            
                                R: Reorder facet_wrapped x-axis with free_x in ggplot2
                            
                                How to order data within subgroups in data.table R
                            
                                Different colour palettes for two different colour aesthetic mappings in ggplot2
                            
                                Why is zoo::rollmean slow compared to a simple Rcpp implementation?
                            
                                How to hide figures in knitr, but create them as png?
                            
                                R data.table: subgroup weighted percent of group
                            
                                How to check if a filename is writeable in R?
                            
                                dplyr mutate using rbinom do not return random numbers
                            
                                Plotting POSIXct timestamp series with ggplot2
                            
                                nls troubles: Missing value or an infinity produced when evaluating the model
                            
                                Filter groups in dplyr that exclusively contain specific combinations of values
                            
                                group_by() into fill() not working as expected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to remove all duplicates in R

Tags:

performance

r

duplicates

unique

Megatron

People also ask

1 Answers

Raad

Recent Activity

Donate For Us