I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.
I can quickly get which column has the maximum value for each row using
dt <- dt[, maxCol := which.max(.SD), by=pmxid]
but trying to get the actual maximum value using
dt <- dt[, nmax := max(.SD), by=pmxid]
is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.
How come finding the maximum takes so long? Shouldn't it take the same time as which.max()
or less?
The maximum number of rows that a DataTable can store is 16,777,216.
According to MSDN DataTable Class "The maximum number of rows that a DataTable can store is 16,777,216"
If we want to find the maximum of values two or more columns for each row in an R data frame then pmax function can be used.
Though, you are seeking a data.table
solution, here is a base R
solution which would be fast enough for your dataset.
indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]
On a slightly bigger dataset, system.time
comparisons revealed
system.time({
indx <- max.col(df1, ties.method='first')
res <- df1[cbind(1:nrow(df1), indx)]
})
# user system elapsed
# 2.180 0.163 2.345
df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
# user system elapsed
#1265.792 2.305 1267.836
base R
method to be faster than the data.table
method in the post.
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10,
replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999
set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With