Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to get a vector of sorted unique values from a data.table?

The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a data.table:

# 1
sort(salesdt[, unique(company)])
#2 
sort(unique(salesdt$company))
#3
salesdt[order(company), unique(company)]

Another answer suggested other sort options than lexicographical order:

salesdt[, .N, by = company][order(-N), company]
salesdt[, sum(sales), by = company][order(-V1), company]

The data.table was created by

library(data.table)
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)
salesdt <- data.table(company,item,sales) 

As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the data.table would be much larger. I have searched a bit on SO but haven't found a particular answer so far.

like image 851
Uwe Avatar asked Apr 30 '16 09:04

Uwe


People also ask

How do I find unique values in a vector in R?

To find unique values in a column in a data frame, use the unique() function in R. In Exploratory Data Analysis, the unique() function is crucial since it detects and eliminates duplicate values in the data.

How do you find unique values of data?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How do I extract unique values from a column in R?

To extract unique elements from Vector, data frame, or array-like R object, use the unique() function. The unique() is an inbuilt R function that returns a vector, data frame, or array-like object but with duplicate elements/rows removed.

How to get a vector of sorted unique values?

Using kit::funique or collapse::funique might be a fast way to get a vector of sorted unique values. Also it should be considered how many treads are used. Using one core in data.table:

What is the best way to sort a vector?

If you have relatively few duplicates, then sort followed by unique and erase seems the way to go. If you had relatively many duplicates, creating a set from the vector and letting it do the heavy lifting could easily beat it. Don't just concentrate on time efficiency either.

How to find unique elements in a vector compared against multiple vectors?

Efficient way to find unique elements in a vector compared against multiple vectors 1 Method 1: Is there a more efficient way to find the elements than running std::find on all the vectors for all the... 2 Method 2: Extra overhead in comparing vectors v2,v3,v4,v5 and sorting them. More ...

How to find the unique values of a vector in SAS?

In the SAS/IML language, you can use the UNIQUE function to find the unique values of a vector. In all these cases, the values are returned in sorted (alphanumeric) order. For example, here are three ways to obtain the unique values of the TYPE variable in the Sashelp.Cars data set.


2 Answers

For benchmarking, a larger data.table is created with 1.000.000 rows:

n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

For the sake of completeness also the variants

# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])

will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.

In addition, two other sort options from this answer are included:

# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]

Edit: Following from Frank's comment, I've included his suggestion:

# 8
salesdt[,logical(1), keyby = company]$company

Benchmarking, no key set

Benchmarking is done with help of the microbenchmark package:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[order(company), unique(company)],
  unique(sort(salesdt$company)),
  unique(salesdt[,sort(company)]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings are displayed with

ggplot2::autoplot(timings)

Please, note the reverse order in the chart (#1 at bottom, #8 at top).

enter image description here

As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.

A bit of surprise to me was variant #3. Despite data.table's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.

Benchmarking, data.table keyed by company

Motivated by this observation I repeated the benchmark with the data.table keyed by company.

setkeyv(salesdt, "company")

The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.

enter image description here

Benchmarking, keyed with a bit of tuning

Variant #3 still includes order(company) which isn't necessary if already keyed by company. So, I removed the unnecessary calls to order and sort from #3 and #5:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[, unique(company)],
  unique(salesdt$company),
  unique(salesdt[, company]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.

enter image description here

Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7.

like image 145
Uwe Avatar answered Nov 15 '22 12:11

Uwe


Alternatively you could do the following:

library(data.table)
n <- 1e6
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

ptm <- proc.time() 
sort(salesdt[, unique(company)])
proc.time() - ptm

ptm <- proc.time() 
sort(unique(salesdt$company))
proc.time() - ptm

ptm <- proc.time() 
salesdt[order(company), unique(company)]
proc.time() - ptm

Information provided by proc.time is not as thorough as microbenchmark, but it is simpler.

Output for the above is:

sort(salesdt[, unique(company)])
user  system elapsed 
0.05    0.02    0.06 

sort(unique(salesdt$company))
user  system elapsed 
0.01    0.01    0.03 

salesdt[order(company), unique(company)]
user  system elapsed 
0.03    0.02    0.05 

Where user time relates to code execution, system time to CPU, and elapsed time is the difference since starting the stopwatch (and will be equal to the sum of user and system times if code run altogether). (taken from http://www.ats.ucla.edu/stat/r/faq/timing_code.htm)

like image 33
Krug Avatar answered Nov 15 '22 10:11

Krug