I have a data.frame (link to file) with 18 columns and 11520 rows that I transform like this:
library(plyr)
df.median<-ddply(data, .(groupname,starttime,fPhase,fCycle),
numcolwise(median), na.rm=TRUE)
according to system.time(), it takes about this long to run:
user system elapsed
5.16 0.00 5.17
This call is part of a webapp, so run time is pretty important. Is there a way to speed this call up?
Beyond performance limitations due to design and implementation, it has to be said that a lot of R code is slow simply because it's poorly written. Few R users have any formal training in programming or software development.
table function is fastest, followed by the tidyverse version and then the base R function. By calculating the relative speeds, we can see that compared to the data. table function, the base R function is almost 4 times and the dplyr function is 3 times slower!
I just noticed that the same codes run much faster in R launched from the server's terminal than in Rstudio Server, and the difference is quite significant. For example, I wrote a heavy R script that handles a large amount of data (>10GB).
Just using aggregate
is quite a bit faster...
> groupVars <- c("groupname","starttime","fPhase","fCycle")
> dataVars <- colnames(data)[ !(colnames(data) %in% c("location",groupVars)) ]
>
> system.time(ag.median <- aggregate(data[,dataVars], data[,groupVars], median))
user system elapsed
1.89 0.00 1.89
> system.time(df.median <- ddply(data, .(groupname,starttime,fPhase,fCycle), numcolwise(median), na.rm=TRUE))
user system elapsed
5.06 0.00 5.06
>
> ag.median <- ag.median[ do.call(order, ag.median[,groupVars]), colnames(df.median)]
> rownames(ag.median) <- 1:NROW(ag.median)
>
> identical(ag.median, df.median)
[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With