Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table and pmin with na.rm=TRUE argument

Tags:

r

data.table

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.

DT <- data.table(x = c(1,1,2,3,4,1,9), 
                 y = c(2,4,1,2,5,6,6),
                 z = c(3,5,1,7,4,5,3),
                 a = c(1,3,NA,3,5,NA,2))

> DT
   x y z  a
1: 1 2 3  1
2: 1 4 5  3
3: 2 1 1 NA
4: 3 2 7  3
5: 4 5 4  5
6: 1 6 5 NA
7: 9 6 3  2

I can calculate the minimum across rows using columns directly:

DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]

giving

> DT
   x y z  a min_val
1: 1 2 3  1       1
2: 1 4 5  3       1
3: 2 1 1 NA       1
4: 3 2 7  3       2
5: 4 5 4  5       4
6: 1 6 5 NA       1
7: 9 6 3  2       2

However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')

I can do this:

DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]

But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as

DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]

but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.

Any ideas?

like image 685
Allen Wang Avatar asked Mar 03 '16 17:03

Allen Wang


People also ask

What is the difference between data frame and data table in R?

frame in R is similar to the data table which is used to create tabular data but data table provides a lot more features than the data frame so, generally, all prefer the data. table instead of the data.

Why is data table faster than Dplyr?

table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .

What is a data table?

A data table is a range of cells in which you can change values in some of the cells and come up with different answers to a problem. A good example of a data table employs the PMT function with different loan amounts and interest rates to calculate the affordable amount on a home mortgage loan.


1 Answers

If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.

DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
#   x y z  a col_min
#1: 1 2 3  1       1
#2: 1 4 5  3       1
#3: 2 1 1 NA       1
#4: 3 2 7  3       2
#5: 4 5 4  5       4
#6: 1 6 5 NA       1
#7: 9 6 3  2       2

If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.

DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))), 
                .SDcols= col_names]
like image 192
akrun Avatar answered Oct 06 '22 10:10

akrun