Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to summarise variables based on column

I wish to summarise some variables in my data frame based on a column. However my data frame is rather large (>30,000,000 rows) and using the summarise function in dplyr takes ages to run. Is there a faster way in R to speed up the summarising process?

I have a data frame df in the format:

  proid  X1  X2  X3 X4
1     1  zz   a   e  a
2     2  ff   g   z  b
3     1  cd   s   f  d 
4     3  ab   t   e  e
5     2  ta   b   h  k
      ....

I wish to concatenate the variables X1 to X4 when they have the same prodid value. The concatenated strings are separated by commas. So the above table should give me the new table:

  proid     X1   X2   X3  X4
1     1  zz,cd  a,s  e,f a,d 
2     2  ff,ta  g,b  z,h b,k
3     3     ab    t    e   e
      ....

I have used the following dplyr code:

concat <- function(x){
  x <- na.omit(x)
  if(length(x)==0){
    return(as.character(NA))
  }else{
    return(paste(x,collapse=","))
  }
}

dg<-group_by(df,proid)
df<-summarise(dg,proid,concat(X1),concat(X2),concat(X3),concat(X4))
like image 945
user22119 Avatar asked May 28 '15 15:05

user22119


2 Answers

Edit note: removed original part of my answer that didn't address NA treatment & added a benchmark.

concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")

Using data.table:

setDT(df)[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
#   proid    X1  X2  X3
#1:     1 zz,cd a,s e,f
#2:     2 ff,ta g,b z,h
#3:     3    NA   t   e

Using dplyr:

df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)

Benchmark, smaller data than in actual use case and not fully representative, so just wanted to get an impression how concat2 compares to concat etc..

library(microbenchmark)
library(dplyr)
library(data.table)

N <- 1e6
x <- c(letters, LETTERS)
df <- data.frame(
  proid = sample(1e4, N, TRUE),
  X1 = sample(sample(c(x, NA), N, TRUE)),
  X2 = sample(sample(c(x, NA), N, TRUE)),
  X3 = sample(sample(c(x, NA), N, TRUE)),
  X4 = sample(sample(c(x, NA), N, TRUE))
  )

dt <- as.data.table(df)

concat <- function(x){
  x <- na.omit(x)
  if(length(x)==0){
    return(as.character(NA))
  }else{
    return(paste(x,collapse=","))
  }
}

concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")

concat.dplyr <- function(){
  df %>% group_by(proid) %>% summarise_each(funs(concat), -X4)
}

concat2.dplyr <- function(){
  df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
}

concat.data.table <- function(){
  dt[, lapply(.SD, concat), by = proid, .SDcols = -c("X4")]
}

concat2.data.table <- function(){
  dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
}


microbenchmark(concat.dplyr(), 
               concat2.dplyr(), 
               concat.data.table(), 
               concat2.data.table(),
               unit = "relative",
               times = 10L)
Unit: relative
                 expr      min       lq   median       uq      max neval
       concat.dplyr() 1.058839 1.058342 1.083728 1.105907 1.080883    10
      concat2.dplyr() 1.057991 1.065566 1.109099 1.145657 1.079201    10
  concat.data.table() 1.024101 1.018443 1.093604 1.085254 1.066560    10
 concat2.data.table() 1.000000 1.000000 1.000000 1.000000 1.000000    10

Findings: data.table performs a little bit faster than dplyr on the sample data and concat2 is a little faster than concat. However, differences remain rather small on this sample data set.

like image 83
talat Avatar answered Oct 22 '22 09:10

talat


na.omit does a ton of unnecessary checks and operations. Replacing it with simple is.na call will give you a decent speedup:

concat3 = function(x) {
  x = x[!is.na(x)]
  if (length(x) == 0)
    NA_character_
  else
    paste(x, collapse = ",")
}

Using docendo's data (but with strings instead of factors - factors slow all versions down):

microbenchmark(dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")],
               dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")],
               times = 5)
#Unit: milliseconds
#                                                       expr       min       lq     mean   median       uq      max neval
# dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")]  960.2475 1079.177 1251.545 1342.684 1402.571 1473.045     5
# dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")] 1718.8121 1892.696 2159.148 2171.772 2470.205 2542.253     5
like image 31
eddi Avatar answered Oct 22 '22 08:10

eddi