Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient data table/ frame/ tibble/ other aggregation in R

I have a data.table of values for which I compute summarized statistics in a loop, and seek to aggregate the summarized results for additional processing. However, the loop processing time is unexpectedly lengthy as a result of the aggregation, and I'm seeking a faster solution.

The method very much resembles the approach discussed here (Assign a vector to a specific existing row of data table in R).

Code (slightly abridged for readability, but material components preserved illustrative):

library(data.table);
x <- data.table(matrix(double(),nrow=10000,ncol=120));
system.time({for (i in NROW(x):1) {
    m <- matrix(rnorm(8*15),nrow=8,ncol=15);
}});
#  user  system elapsed 
# 0.165   0.006   0.171 
system.time({for (i in NROW(x):1) {
    m <- matrix(rnorm(8*15),nrow=8,ncol=15);
    as.list(t(m[1:8,]));
}});
#   user  system elapsed 
#  0.245   0.001   0.249
system.time({for (i in NROW(x):1) {
    m <- matrix(rnorm(8*15),nrow=8,ncol=15);
    x[i,] <- as.list(t(m[1:8,]));
}});
#   user  system elapsed 
# 36.227   0.682  37.529

# Obtain input data.table
inputdt <- fread('filename');

# Preallocate summary statistics aggregate
sumstatsdt <- data.table(matrix(double(),nrow=10000,ncol=120));

# Loop over input data.table (the *apply suite not suitable for mypkg::calcstats())
for (i in NROW(inputdt):1) {
    # Produce a matrix of summary statistics for the row (of type double)
    sumstat_matrix <- mypkg::calcstats(inputdt,...);

    # Aggregate the summary statistics (where "a","b","c",... are matrix row names of ordered statistics)
    # >>>> This is the operation that leads to lengthy execution time
    sumstatsdt[i,] <- as.list(t(sumstat_matrix[c("a","b","c",...),]));
};

The input data.table contains 10,000 observations with 8 attributes, and there are a total of 1.2 million summary statistics to be stored (each of type 'double'). When commenting out the last line in the loop that performs the aggregation, the total processing time is around 24 sec. When run with the aggregation, total processing time increases to 34 min.

I have tried to use comparable code with data.frame and cbind() with loosely similar performance results (have not had a chance to try the tidyverse suite). Recognize that deep copy operations will be somewhat slower, though the magnitude of the execution time difference given the relatively small dataset seems to indicate a different issue.

Running R v3.4.4, data.table v1.11.4 on recent Fedora install. Memory usage is negligible (less than 3% of system RAM in use during R script execution). One of the 2.1GHz CPU processors with affinity to the R session runs at nearly 100% for the duration of the script execution. There are no other processes associated with that core and remaining cores are largely idle. (NB: Illustrative code run in KVM guest on different machine)

Sidenote: Also curious as to why CPU bottleneck manifests in what otherwise appears to be a memory issue.

Appreciate the time, and happy to provide additional info as useful.

Edit [2018.10.31]

  • Include illustrative code requested by 42
like image 660
Whee Avatar asked Mar 10 '26 00:03

Whee


1 Answers

Your machine seems about twice as fast as mine, but still I get an almost 15-fold improvement for sequential assignment to particular rows of a data.table object by using the data.table operator := instead of <-. I'm not sure, but suspect that using <- commits you to depending on the usual R step of creation of an intermediate temporary copy, so it's likely that the := method is also more memory efficient:

?`:=`

 system.time({for (i in NROW(x):1) {
     m <- matrix(rnorm(8*15),nrow=8,ncol=15);
     x[i , (1:120) := as.list(t(m))] } })

 #   user  system elapsed 
 # 4.390   0.096   4.486 


system.time({for (i in NROW(x):1) {
     m <- matrix(rnorm(8*15),nrow=8,ncol=15);
     x[i , ] <- as.list(t(m)) } })

#   user  system elapsed 
# 67.963  15.573  83.572 

Regular R is a single thread process unless you install one of the "after-market" mods like MRAN.

like image 102
IRTFM Avatar answered Mar 12 '26 19:03

IRTFM