I need to find the row-wise minimum of many (+60) relatively large <code>data.frame</code> (~ 250,000 x 3) (or I can equivalently work on an <code>xts</code>). <pre class="prettyprint"><code>set.seed(1000) my.df <- sample(1:5, 250000*3, replace=TRUE) dim(my.df) <- c(250000,3) my.df <- as.data.frame(my.df) names(my.df) <- c("A", "B", "C") </code></pre> The data frame <code>my.df</code> looks like this <pre class="prettyprint"><code>> head(my.df) A B C 1 2 5 2 2 4 5 5 3 1 5 3 4 4 4 3 5 3 5 5 6 1 5 3 </code></pre> I tried <pre class="prettyprint"><code>require(data.table) my.dt <- as.data.table(my.df) my.dt[, row.min:=0] # without this: "Attempt to add new column(s) and set subset of rows at the same time" system.time( for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)] ) </code></pre> On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use <code>data.table</code>. Am I using <code>data.table</code> correctly? Is there a more efficient way to do simple row-wise opertations?

The classical way of doing row-wise operations in R is to use <code>apply</code>: <pre class="prettyprint"><code>apply(my.df, 1, min) > head(my.df) A B C min 1 2 5 4 2 2 4 3 1 1 3 1 1 5 1 4 4 1 5 1 5 3 3 4 3 6 1 1 1 1 </code></pre> On my machine, this operation takes about 0.25 of a second.

After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark: <ul> <li> <code>pmin</code> (Matt Dowle's answer above), below as tm1 </li> <li> <code>apply</code> (Andrie's answer above), below as tm2 </li> <li>melting first, then min by group, below as tm3 </li> </ul> so: <pre class="prettyprint"><code>library(microbenchmark); library(data.table) set.seed(1000) b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric()) for (m in c(2.5,100)*1e5){ for (n in c(3,50)){ my.df <- sample(1:5, m*n, replace=TRUE) dim(my.df) <- c(m,n) my.df <- as.data.frame(my.df) names(my.df) <- c(LETTERS,letters)[1:n] my.dt <- as.data.table(my.df) tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6 my.dt <- as.data.table(my.df) tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6 my.dt <- as.data.table(my.df)sv tm3 <- mean(microbenchmark( melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id], times=30L )$time)/1e6 b <- rbind(b, data.table(m, n, tm1, tm2, tm3) ) } } </code></pre> (I ran out of time to try more combinations) gives us: <pre class="prettyprint"><code>b # m n tm1 tm2 tm3 # 1: 2.5e+05 3 16.20598 1000.345 39.36171 # 2: 2.5e+05 50 166.60470 1452.239 588.49519 # 3: 1.0e+07 3 662.60692 31122.386 1668.83134 # 4: 1.0e+07 50 6594.63368 50915.079 17098.96169 c <- melt(b, id.vars=c('m','n')) library(ggplot2) ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() + ylab('Runtime (millisec)') + xlab('# of rows') + guides(linetype=guide_legend(title='Number of columns')) </code></pre> <img src="https://i.stack.imgur.com/ECrjw.png" alt="enter image description here"> Although I knew <code>apply</code> (tm2) would scale poorly, I am surprised that <code>pmin</code> (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where <code>pmin</code> shouldn't be used over melt-min-by-group (tm3).

Efficient row-wise operations on a data.table

Tags:

r

data.table

I need to find the row-wise minimum of many (+60) relatively large data.frame (~ 250,000 x 3) (or I can equivalently work on an xts).

set.seed(1000)
my.df <- sample(1:5, 250000*3, replace=TRUE)
dim(my.df) <- c(250000,3)
my.df <- as.data.frame(my.df)
names(my.df) <- c("A", "B", "C")

The data frame my.df looks like this

> head(my.df)

  A B C
1 2 5 2
2 4 5 5
3 1 5 3
4 4 4 3
5 3 5 5
6 1 5 3

I tried

require(data.table)
my.dt <- as.data.table(my.df)

my.dt[, row.min:=0]  # without this: "Attempt to add new column(s) and set subset of rows at the same time"
system.time(
  for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)]
)

On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use data.table. Am I using data.table correctly? Is there a more efficient way to do simple row-wise opertations?

542

asked Oct 25 '11 05:10

Ryogi

3 Answers

Or, just pmin.

my.dt <- as.data.table(my.df)
system.time(my.dt[,row.min:=pmin(A,B,C)])
# user  system elapsed 
# 0.02    0.00    0.01 
head(my.dt)
#      A B C row.min
# [1,] 2 5 2       2
# [2,] 4 5 5       4
# [3,] 1 5 3       1
# [4,] 4 4 3       3
# [5,] 3 5 5       3
# [6,] 1 5 3       1

136

answered Oct 20 '22 08:10

Matt Dowle

The classical way of doing row-wise operations in R is to use apply:

apply(my.df, 1, min)
> head(my.df)
  A B C min
1 2 5 4   2
2 4 3 1   1
3 1 1 5   1
4 4 1 5   1
5 3 3 4   3
6 1 1 1   1

On my machine, this operation takes about 0.25 of a second.

answered Oct 20 '22 09:10

Andrie

After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark:

pmin (Matt Dowle's answer above), below as tm1
apply (Andrie's answer above), below as tm2
melting first, then min by group, below as tm3

so:

library(microbenchmark); library(data.table)
set.seed(1000)
b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric())

for (m in c(2.5,100)*1e5){

  for (n in c(3,50)){
    my.df <- sample(1:5, m*n, replace=TRUE)
    dim(my.df) <- c(m,n)    
    my.df <- as.data.frame(my.df)
    names(my.df) <- c(LETTERS,letters)[1:n]   
    my.dt <- as.data.table(my.df)

    tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)
    tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)sv
    tm3 <- mean(microbenchmark(
                melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id], 
                times=30L
               )$time)/1e6
    b <- rbind(b, data.table(m, n, tm1, tm2, tm3) ) 
  }
}

(I ran out of time to try more combinations) gives us:

b
#          m  n        tm1       tm2         tm3
# 1: 2.5e+05  3   16.20598  1000.345    39.36171
# 2: 2.5e+05 50  166.60470  1452.239   588.49519
# 3: 1.0e+07  3  662.60692 31122.386  1668.83134
# 4: 1.0e+07 50 6594.63368 50915.079 17098.96169
c <- melt(b, id.vars=c('m','n'))

library(ggplot2)
ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() +
  ylab('Runtime (millisec)') + xlab('# of rows') +  
  guides(linetype=guide_legend(title='Number of columns'))

enter image description here

Although I knew apply (tm2) would scale poorly, I am surprised that pmin (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where pmin shouldn't be used over melt-min-by-group (tm3).

answered Oct 20 '22 07:10

C8H10N4O2

Related questions
                            
                                Does an R compiler to C/C++ exist?
                            
                                How do I select columns that may or may not exist?
                            
                                R glmnet : "(list) object cannot be coerced to type 'double' "
                            
                                How to indent multiple lines of code in Rstudio?
                            
                                Using regex in R to find strings as whole words (but not strings as part of words)
                            
                                Using parLapply and clusterExport inside a function
                            
                                Increasing area around plot area in ggplot2 [duplicate]
                            
                                Figures captions and labels in knitr
                            
                                Read SAS sas7bdat data into R
                            
                                How to round a number and make it show zeros?
                            
                                Space after every five rows in kable output (with booktabs option) in R Markdown document
                            
                                Computing cross-correlation function?
                            
                                What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?
                            
                                Splitting a file name into name,extension
                            
                                More than six shapes in ggplot
                            
                                Repeat list object n times
                            
                                How can I read multiple (excel) files into R? [duplicate]
                            
                                How to use dplyr::mutate_all for rounding selected columns
                            
                                Datasets for Running Statistical Analysis on [closed]
                            
                                Obtaining threshold values from a ROC curve

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With