Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row maximum in data table

Tags:

r

data.table

I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in.

I can quickly get which column has the maximum value for each row using

dt <- dt[, maxCol := which.max(.SD), by=pmxid]

but trying to get the actual maximum value using

dt <- dt[, nmax := max(.SD), by=pmxid]

is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the max column took approx. 2 mins for all 8,000,000 rows.

How come finding the maximum takes so long? Shouldn't it take the same time as which.max() or less?

like image 766
mattdevlin Avatar asked Feb 12 '15 19:02

mattdevlin


People also ask

How many rows can DataTable handle?

The maximum number of rows that a DataTable can store is 16,777,216.

How much data can a DataTable hold?

According to MSDN DataTable Class "The maximum number of rows that a DataTable can store is 16,777,216"

How do you calculate the max of a row in R?

If we want to find the maximum of values two or more columns for each row in an R data frame then pmax function can be used.


1 Answers

Though, you are seeking a data.table solution, here is a base R solution which would be fast enough for your dataset.

indx <- max.col(df, ties.method='first')
df[cbind(1:nrow(df), indx)]

On a slightly bigger dataset, system.time comparisons revealed

system.time({
 indx <- max.col(df1, ties.method='first')
 res <- df1[cbind(1:nrow(df1), indx)]
})
#   user  system elapsed 
# 2.180   0.163   2.345 



df1$pmxid <- 1:nrow(df1)
dt <- as.data.table(df1)
system.time(dt[, nmax:= max(.SD), by= pmxid])
#      user   system  elapsed 
#1265.792    2.305 1267.836 

base R method to be faster than the data.table method in the post.

data

set.seed(24)
df <- as.data.frame(matrix(sample(c(NA,0:20), 20*10, 
       replace=TRUE), ncol=10))
#if there are NAs, change it to lowest number
df[is.na(df)] <- -999

set.seed(585)
df1 <- as.data.frame(matrix(sample(c(NA,0:20), 100*1e6,
 replace=TRUE), ncol=100))
df1[is.na(df1)] <- -999
like image 117
akrun Avatar answered Oct 22 '22 13:10

akrun