Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest by column sort in R

Tags:

r

data.table

I have a data frame full from which I want to take the last column and a column v. I then want to sort both columns on v in the fastest way possible. full is read in from a csv but this can be used for testing (included some NAs for realism):

n <- 200000
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 10000), 'A'] <- NA
v <- 1

I have v as one here, but in reality it could change, and full has many columns.


I have tried sorting data frames, data tables and matrices each with order and sort.list (some ideas taken from this thread). The code for all these:

# DATA FRAME

ord_df <- function() {
  a <- full[c(v, length(full))]
  a[with(a, order(a[1])), ]
}

sl_df <- function() {
  a <- full[c(v, length(full))]
  a[sort.list(a[[1]]), ] 
}


# DATA TABLE

require(data.table)

ord_dt <- function() {
  a <- as.data.table(full[c(v, length(full))])
  colnames(a)[1] <- 'values'
  a[order(values)]
}

sl_dt <- function() {
 a <- as.data.table(full[c(v, length(full))])
 colnames(a)[1] <- 'values'
 a[sort.list(values)]
}


# MATRIX

ord_mat <- function() {
  a <- as.matrix(full[c(v, length(full))])
  a[order(a[, 1]), ] 
}

sl_mat <- function() {
  a <- as.matrix(full[c(v, length(full))])
  a[sort.list(a[, 1]), ] 
}

Time results:

         ord_df  sl_df    ord_dt   sl_dt   ord_mat sl_mat
Min.     0.230   0.1500   0.1300   0.120   0.140   0.1400
Median   0.250   0.1600   0.1400   0.140   0.140   0.1400
Mean     0.244   0.1610   0.1430   0.136   0.142   0.1450
Max.     0.250   0.1700   0.1600   0.140   0.160   0.1600

Or using microbenchmark (results are in milliseconds):

             min      lq       median   uq       max
1  ord_df() 243.0647 248.2768 254.0544 265.2589 352.3984
2  ord_dt() 133.8159 140.0111 143.8202 148.4957 181.2647
3 ord_mat() 140.5198 146.8131 149.9876 154.6649 191.6897
4   sl_df() 152.6985 161.5591 166.5147 171.2891 194.7155
5   sl_dt() 132.1414 139.7655 144.1281 149.6844 188.8592
6  sl_mat() 139.2420 146.8578 151.6760 156.6174 186.5416

Seems like ordering the data table wins. There isn't all that much difference between order and sort.list except when using data frames where sort.list is much faster.

In the data table versions I also tried setting v as the key (since it is then sorted according to the documentation) but I couldn't get it work since the contents of v are not integer.

I would ideally like to speed this up as much as possible since I have to do it many times for different v values. Does anyone know how I might be able to speed this process up even further? Also might it be worth trying an Rcpp implementation? Thanks.


Here's the code I used for timing if it's useful to anyone:

sortMethods <- list(ord_df, sl_df, ord_dt, sl_dt, ord_mat, sl_mat)

require(plyr)
timings <- raply(10, sapply(sortMethods, function(x) system.time(x())[[3]]))
colnames(timings) <- c('ord_df', 'sl_df', 'ord_dt', 'sl_dt', 'ord_mat', 'sl_mat')
apply(timings, 2, summary) 

require(microbenchmark)
mb <- microbenchmark(ord_df(), sl_df(), ord_dt(), sl_dt(), ord_mat(), sl_mat())
plot(mb)
like image 789
Fist Avatar asked Jul 19 '12 10:07

Fist


1 Answers

I don't know if it's better to put this sort of thing in as an edit but it seems more like answer so here will do. Updated test functions:

n <- 1e7
full <- data.frame(A = runif(n, 1, 10000), B = floor(runif(n, 0, 1.9)))
full[sample(n, 100000), 'A'] <- NA

fdf <- full
fma <- as.matrix(full)
fdt <- as.data.table(full)
setnames(fdt, colnames(fdt)[1], 'values')

# DATA FRAME
ord_df <- function() { fdf[order(fdf[1]), ] }
sl_df <- function() { fdf[sort.list(fdf[[1]]), ] }

# DATA TABLE
require(data.table)
ord_dt <- function() { fdt[order(values)] }

key_dt <- function() {
  setkey(fdt, values) 
  fdt
}

# MATRIX
ord_mat <- function() { fma[order(fma[, 1]), ] }
sl_mat <- function() { fma[sort.list(fma[, 1]), ] }

Results (using a different computer, R 2.13.1 and data.table 1.8.2):

         ord_df  sl_df   ord_dt  key_dt  ord_mat sl_mat
Min.     37.56   20.86   2.946   2.249   20.22   20.21
1st Qu.  37.73   21.15   2.962   2.255   20.54   20.59
Median   38.43   21.74   3.002   2.280   21.05   20.82
Mean     38.76   21.75   3.074   2.395   21.09   20.95
3rd Qu.  39.85   22.18   3.151   2.445   21.48   21.42
Max.     40.36   23.08   3.330   2.797   22.41   21.84

Sorting

So data.table is the clear winner. Using a key is faster than ordering, and has a nicer syntax as well I'd argue. Thanks for the help everyone.

like image 125
Fist Avatar answered Jan 03 '23 00:01

Fist