The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a <code>data.table</code>: <pre class="prettyprint"><code># 1 sort(salesdt[, unique(company)]) #2 sort(unique(salesdt$company)) #3 salesdt[order(company), unique(company)] </code></pre> Another answer suggested other sort options than lexicographical order: <pre class="prettyprint"><code>salesdt[, .N, by = company][order(-N), company] salesdt[, sum(sales), by = company][order(-V1), company] </code></pre> The <code>data.table</code> was created by <pre class="prettyprint"><code>library(data.table) company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W") item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", "Thingy", "Grommit", "Thingy", "Widget", "Thingy") sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200) salesdt <- data.table(company,item,sales) </code></pre> As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the <code>data.table</code> would be much larger. I have searched a bit on SO but haven't found a particular answer so far.

For benchmarking, a larger <code>data.table</code> is created with 1.000.000 rows: <pre class="prettyprint"><code>n <- 1e6 set.seed(1234) # to reproduce the data salesdt <- data.table(company = sample(company, n, TRUE), item = sample(item, n, TRUE), sales = sample(sales, n, TRUE)) </code></pre> For the sake of completeness also the variants <pre class="prettyprint"><code># 4 unique(sort(salesdt$company)) # 5 unique(salesdt[,sort(company)]) </code></pre> will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around. In addition, two other sort options from this answer are included: <pre class="prettyprint"><code># 6 salesdt[, .N, by = company][order(-N), company] # 7 salesdt[, sum(sales), by = company][order(-V1), company] </code></pre> Edit: Following from Frank's comment, I've included his suggestion: <pre class="prettyprint"><code># 8 salesdt[,logical(1), keyby = company]$company </code></pre> <h3>Benchmarking, no key set</h3> Benchmarking is done with help of the <code>microbenchmark</code> package: <pre class="prettyprint"><code>timings <- microbenchmark::microbenchmark( sort(salesdt[, unique(company)]), sort(unique(salesdt$company)), salesdt[order(company), unique(company)], unique(sort(salesdt$company)), unique(salesdt[,sort(company)]), salesdt[, .N, by = company][order(-N), company], salesdt[, sum(sales), by = company][order(-V1), company], salesdt[,logical(1), keyby = company]$company ) </code></pre> The timings are displayed with <pre class="prettyprint"><code>ggplot2::autoplot(timings) </code></pre> Please, note the reverse order in the chart (#1 at bottom, #8 at top). <img src="https://i.stack.imgur.com/zP7IA.png" alt="enter image description here"> As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment. A bit of surprise to me was variant #3. Despite <code>data.table</code>'s fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values. <h3>Benchmarking, data.table keyed by <code>company</code> </h3> Motivated by this observation I repeated the benchmark with the <code>data.table</code> keyed by <code>company</code>. <pre class="prettyprint"><code>setkeyv(salesdt, "company") </code></pre> The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section. <img src="https://i.stack.imgur.com/n3TE8.png" alt="enter image description here"> <h3>Benchmarking, keyed with a bit of tuning</h3> Variant #3 still includes <code>order(company)</code> which isn't necessary if already keyed by <code>company</code>. So, I removed the unnecessary calls to <code>order</code> and <code>sort</code> from #3 and #5: <pre class="prettyprint"><code>timings <- microbenchmark::microbenchmark( sort(salesdt[, unique(company)]), sort(unique(salesdt$company)), salesdt[, unique(company)], unique(salesdt$company), unique(salesdt[, company]), salesdt[, .N, by = company][order(-N), company], salesdt[, sum(sales), by = company][order(-V1), company], salesdt[,logical(1), keyby = company]$company ) </code></pre> The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests. <img src="https://i.stack.imgur.com/cnGdR.png" alt="enter image description here"> Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with <code>data.table v.1.9.7</code>.

Alternatively you could do the following: <pre class="prettyprint"><code>library(data.table) n <- 1e6 salesdt <- data.table(company = sample(company, n, TRUE), item = sample(item, n, TRUE), sales = sample(sales, n, TRUE)) ptm <- proc.time() sort(salesdt[, unique(company)]) proc.time() - ptm ptm <- proc.time() sort(unique(salesdt$company)) proc.time() - ptm ptm <- proc.time() salesdt[order(company), unique(company)] proc.time() - ptm </code></pre> Information provided by <code>proc.time</code> is not as thorough as <code>microbenchmark</code>, but it is simpler. Output for the above is: <pre class="prettyprint"><code>sort(salesdt[, unique(company)]) user system elapsed 0.05 0.02 0.06 sort(unique(salesdt$company)) user system elapsed 0.01 0.01 0.03 salesdt[order(company), unique(company)] user system elapsed 0.03 0.02 0.05 </code></pre> Where user time relates to code execution, system time to CPU, and elapsed time is the difference since starting the stopwatch (and will be equal to the sum of user and system times if code run altogether). (taken from http://www.ats.ucla.edu/stat/r/faq/timing_code.htm)

What is the fastest way to get a vector of sorted unique values from a data.table?

Tags:

dataframe

r

data.table

The answer to this question (Unique sorted rows single column from R data.table) suggested three different ways to get a vector of sorted unique values from a data.table:

# 1
sort(salesdt[, unique(company)])
#2 
sort(unique(salesdt$company))
#3
salesdt[order(company), unique(company)]

Another answer suggested other sort options than lexicographical order:

salesdt[, .N, by = company][order(-N), company]
salesdt[, sum(sales), by = company][order(-V1), company]

The data.table was created by

library(data.table)
company <- c("A", "S", "W", "L", "T", "T", "W", "A", "T", "W")
item <- c("Thingy", "Thingy", "Widget", "Thingy", "Grommit", 
          "Thingy", "Grommit", "Thingy", "Widget", "Thingy")
sales <- c(120, 140, 160, 180, 200, 120, 140, 160, 180, 200)
salesdt <- data.table(company,item,sales)

As always, if different options are available to choose from I started to wonder what the best solution would be, in particular if the data.table would be much larger. I have searched a bit on SO but haven't found a particular answer so far.

851

asked Apr 30 '16 09:04

Uwe

2 Answers

For benchmarking, a larger data.table is created with 1.000.000 rows:

n <- 1e6
set.seed(1234) # to reproduce the data
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

For the sake of completeness also the variants

# 4
unique(sort(salesdt$company))
# 5
unique(salesdt[,sort(company)])

will be benchmarked although it seems to be obvious that sorting unique values should be faster than the other way around.

In addition, two other sort options from this answer are included:

# 6
salesdt[, .N, by = company][order(-N), company]
# 7
salesdt[, sum(sales), by = company][order(-V1), company]

Edit: Following from Frank's comment, I've included his suggestion:

# 8
salesdt[,logical(1), keyby = company]$company

Benchmarking, no key set

Benchmarking is done with help of the microbenchmark package:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[order(company), unique(company)],
  unique(sort(salesdt$company)),
  unique(salesdt[,sort(company)]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings are displayed with

ggplot2::autoplot(timings)

Please, note the reverse order in the chart (#1 at bottom, #8 at top).

enter image description here

As expected, variants #4 and #5 (unique after sort) are pretty slow. Edit: #8 is the fastest which confirms Frank's comment.

A bit of surprise to me was variant #3. Despite data.table's fast radix sort it is less efficient than #1 and #2. It seems to sort first and then to extract the unique values.

Benchmarking, data.table keyed by `company`

Motivated by this observation I repeated the benchmark with the data.table keyed by company.

setkeyv(salesdt, "company")

The timings show (please not the change in scale of the time axis) that #4 and #5 have been accelerated dramatically by keying. They are even faster than #3. Note that timings for variant #8 are included in the next section.

enter image description here

Benchmarking, keyed with a bit of tuning

Variant #3 still includes order(company) which isn't necessary if already keyed by company. So, I removed the unnecessary calls to order and sort from #3 and #5:

timings <- microbenchmark::microbenchmark(
  sort(salesdt[, unique(company)]),
  sort(unique(salesdt$company)),
  salesdt[, unique(company)],
  unique(salesdt$company),
  unique(salesdt[, company]),
  salesdt[, .N, by = company][order(-N), company],
  salesdt[, sum(sales), by = company][order(-V1), company],
  salesdt[,logical(1), keyby = company]$company
)

The timings now show variants #1 to #4 on the same level. Edit: Again, #8 (Frank's solution) is the fastests.

enter image description here

Caveat: The benchmarking is based on the original data which only includes 5 different letters as company names. It is likely that the result will look differently with a larger number of distinct company names. The results have been obtained with data.table v.1.9.7.

145

answered Nov 15 '22 12:11

Uwe

Alternatively you could do the following:

library(data.table)
n <- 1e6
salesdt <- data.table(company = sample(company, n, TRUE), 
                      item = sample(item, n, TRUE), 
                      sales = sample(sales, n, TRUE))

ptm <- proc.time() 
sort(salesdt[, unique(company)])
proc.time() - ptm

ptm <- proc.time() 
sort(unique(salesdt$company))
proc.time() - ptm

ptm <- proc.time() 
salesdt[order(company), unique(company)]
proc.time() - ptm

Information provided by proc.time is not as thorough as microbenchmark, but it is simpler.

Output for the above is:

sort(salesdt[, unique(company)])
user  system elapsed 
0.05    0.02    0.06 

sort(unique(salesdt$company))
user  system elapsed 
0.01    0.01    0.03 

salesdt[order(company), unique(company)]
user  system elapsed 
0.03    0.02    0.05

Where user time relates to code execution, system time to CPU, and elapsed time is the difference since starting the stopwatch (and will be equal to the sum of user and system times if code run altogether). (taken from http://www.ats.ucla.edu/stat/r/faq/timing_code.htm)

answered Nov 15 '22 10:11

Krug

Related questions
                            
                                Error in train from Caret
                            
                                How do you use R in Python to return an R graph through Django?
                            
                                Fill in columns with 1 or 0, if that colname matches a variable found in other named columns within the same dataframe
                            
                                data.table is not displayed on first call after being modified in a function [duplicate]
                            
                                Merging a data frame from a list of data frames [duplicate]
                            
                                How can I find the pixel-wise standard deviation?
                            
                                Convert unstructured csv file to a data frame
                            
                                How to sort groups within sorted groups?
                            
                                How to get a matrix element without the column name in R?
                            
                                Create a sequence of sequences of different lengths
                            
                                Change colour of some points in plot
                            
                                Append a data frame to a master data frame if some columns are common [duplicate]
                            
                                No database selected with RMySQL
                            
                                Matching a word after another word in R regex
                            
                                Set a delay between two instructions in R [duplicate]
                            
                                Combining starts_with with group_by in dplyr
                            
                                Incremental IDs in a R data frame [duplicate]
                            
                                Get column index of maximum value in each Row of matrix
                            
                                Plot multiple datasets with ggplot
                            
                                Shortcut for executing several lines in RStudio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to get a vector of sorted unique values from a data.table?

Tags:

dataframe

r

data.table

Uwe

People also ask

2 Answers

Benchmarking, no key set

Benchmarking, data.table keyed by `company`

Benchmarking, keyed with a bit of tuning

Uwe

Krug

Recent Activity

Donate For Us

What is the fastest way to get a vector of sorted unique values from a data.table?

Tags:

dataframe

r

data.table

Uwe

People also ask

2 Answers

Benchmarking, no key set

Benchmarking, data.table keyed by company

Benchmarking, keyed with a bit of tuning

Uwe

Krug

Related questions

Recent Activity

Donate For Us

Benchmarking, data.table keyed by `company`