I wish to summarise some variables in my data frame based on a column. However my data frame is rather large (>30,000,000 rows) and using the summarise function in dplyr takes ages to run. Is there a faster way in R to speed up the summarising process?
I have a data frame df in the format:
proid X1 X2 X3 X4
1 1 zz a e a
2 2 ff g z b
3 1 cd s f d
4 3 ab t e e
5 2 ta b h k
....
I wish to concatenate the variables X1 to X4 when they have the same prodid value. The concatenated strings are separated by commas. So the above table should give me the new table:
proid X1 X2 X3 X4
1 1 zz,cd a,s e,f a,d
2 2 ff,ta g,b z,h b,k
3 3 ab t e e
....
I have used the following dplyr code:
concat <- function(x){
x <- na.omit(x)
if(length(x)==0){
return(as.character(NA))
}else{
return(paste(x,collapse=","))
}
}
dg<-group_by(df,proid)
df<-summarise(dg,proid,concat(X1),concat(X2),concat(X3),concat(X4))
Edit note: removed original part of my answer that didn't address NA treatment & added a benchmark.
concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")
Using data.table:
setDT(df)[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
# proid X1 X2 X3
#1: 1 zz,cd a,s e,f
#2: 2 ff,ta g,b z,h
#3: 3 NA t e
Using dplyr:
df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
Benchmark, smaller data than in actual use case and not fully representative, so just wanted to get an impression how concat2
compares to concat
etc..
library(microbenchmark)
library(dplyr)
library(data.table)
N <- 1e6
x <- c(letters, LETTERS)
df <- data.frame(
proid = sample(1e4, N, TRUE),
X1 = sample(sample(c(x, NA), N, TRUE)),
X2 = sample(sample(c(x, NA), N, TRUE)),
X3 = sample(sample(c(x, NA), N, TRUE)),
X4 = sample(sample(c(x, NA), N, TRUE))
)
dt <- as.data.table(df)
concat <- function(x){
x <- na.omit(x)
if(length(x)==0){
return(as.character(NA))
}else{
return(paste(x,collapse=","))
}
}
concat2 <- function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ",")
concat.dplyr <- function(){
df %>% group_by(proid) %>% summarise_each(funs(concat), -X4)
}
concat2.dplyr <- function(){
df %>% group_by(proid) %>% summarise_each(funs(concat2), -X4)
}
concat.data.table <- function(){
dt[, lapply(.SD, concat), by = proid, .SDcols = -c("X4")]
}
concat2.data.table <- function(){
dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")]
}
microbenchmark(concat.dplyr(),
concat2.dplyr(),
concat.data.table(),
concat2.data.table(),
unit = "relative",
times = 10L)
Unit: relative
expr min lq median uq max neval
concat.dplyr() 1.058839 1.058342 1.083728 1.105907 1.080883 10
concat2.dplyr() 1.057991 1.065566 1.109099 1.145657 1.079201 10
concat.data.table() 1.024101 1.018443 1.093604 1.085254 1.066560 10
concat2.data.table() 1.000000 1.000000 1.000000 1.000000 1.000000 10
Findings: data.table performs a little bit faster than dplyr on the sample data and concat2
is a little faster than concat
. However, differences remain rather small on this sample data set.
na.omit
does a ton of unnecessary checks and operations. Replacing it with simple is.na
call will give you a decent speedup:
concat3 = function(x) {
x = x[!is.na(x)]
if (length(x) == 0)
NA_character_
else
paste(x, collapse = ",")
}
Using docendo's data (but with strings instead of factors - factors slow all versions down):
microbenchmark(dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")],
dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")],
times = 5)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dt[, lapply(.SD, concat3), by = proid, .SDcols = -c("X4")] 960.2475 1079.177 1251.545 1342.684 1402.571 1473.045 5
# dt[, lapply(.SD, concat2), by = proid, .SDcols = -c("X4")] 1718.8121 1892.696 2159.148 2171.772 2470.205 2542.253 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With