I need to find the row-wise minimum of many (+60) relatively large data.frame (~ 250,000 x 3) (or I can equivalently work on an xts).
set.seed(1000)
my.df <- sample(1:5, 250000*3, replace=TRUE)
dim(my.df) <- c(250000,3)
my.df <- as.data.frame(my.df)
names(my.df) <- c("A", "B", "C")
The data frame my.df looks like this
> head(my.df)
A B C
1 2 5 2
2 4 5 5
3 1 5 3
4 4 4 3
5 3 5 5
6 1 5 3
I tried
require(data.table)
my.dt <- as.data.table(my.df)
my.dt[, row.min:=0] # without this: "Attempt to add new column(s) and set subset of rows at the same time"
system.time(
for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)]
)
On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use data.table.
Am I using data.table correctly? Is there a more efficient
way to do simple row-wise opertations?
rowwise (not comparable) By rows; one row at a time.
Row wise mean of the dataframe or mean value of each row in R is calculated using rowMeans() function. Other method to get the row mean in R is by using apply() function.row wise mean of the dataframe is also calculated using dplyr package.
To find the row wise sum of n number of columns can be found by using the rowSums function along with subsetting of the columns with single square brackets.
First of all, create a data frame. Then, using plus sign (+) to add two rows and store the addition in one of the rows. After that, remove the row that is not required by subsetting with single square brackets.
Or, just pmin.
my.dt <- as.data.table(my.df)
system.time(my.dt[,row.min:=pmin(A,B,C)])
# user system elapsed
# 0.02 0.00 0.01
head(my.dt)
# A B C row.min
# [1,] 2 5 2 2
# [2,] 4 5 5 4
# [3,] 1 5 3 1
# [4,] 4 4 3 3
# [5,] 3 5 5 3
# [6,] 1 5 3 1
The classical way of doing row-wise operations in R is to use apply:
apply(my.df, 1, min)
> head(my.df)
A B C min
1 2 5 4 2
2 4 3 1 1
3 1 1 5 1
4 4 1 5 1
5 3 3 4 3
6 1 1 1 1
On my machine, this operation takes about 0.25 of a second.
After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark:
pmin (Matt Dowle's answer above), below as tm1
apply (Andrie's answer above), below as tm2
so:
library(microbenchmark); library(data.table)
set.seed(1000)
b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric())
for (m in c(2.5,100)*1e5){
for (n in c(3,50)){
my.df <- sample(1:5, m*n, replace=TRUE)
dim(my.df) <- c(m,n)
my.df <- as.data.frame(my.df)
names(my.df) <- c(LETTERS,letters)[1:n]
my.dt <- as.data.table(my.df)
tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6
my.dt <- as.data.table(my.df)
tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6
my.dt <- as.data.table(my.df)sv
tm3 <- mean(microbenchmark(
melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id],
times=30L
)$time)/1e6
b <- rbind(b, data.table(m, n, tm1, tm2, tm3) )
}
}
(I ran out of time to try more combinations) gives us:
b
# m n tm1 tm2 tm3
# 1: 2.5e+05 3 16.20598 1000.345 39.36171
# 2: 2.5e+05 50 166.60470 1452.239 588.49519
# 3: 1.0e+07 3 662.60692 31122.386 1668.83134
# 4: 1.0e+07 50 6594.63368 50915.079 17098.96169
c <- melt(b, id.vars=c('m','n'))
library(ggplot2)
ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() +
ylab('Runtime (millisec)') + xlab('# of rows') +
guides(linetype=guide_legend(title='Number of columns'))

Although I knew apply (tm2) would scale poorly, I am surprised that pmin (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where pmin shouldn't be used over melt-min-by-group (tm3).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With