Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply a function to each column with a condition in data.table [R]

Tags:

r

data.table

I'd like to apply a couple functions to a column but I want to apply some logic as to when I do this, in this case when another column has some NA's. To illustrate I'll add some NA to the iris dataset and turn it into a data.table:

library(data.table)

irisdt <- iris
## Prep some example data
irisdt[irisdt$Sepal.Length < 5,]$Sepal.Length <- NA
irisdt[irisdt$Sepal.Width < 3,]$Sepal.Width <- NA

## Turn this into a data.table
irisdt <- as.data.table(iris)

If I wanted to apply max to multiple columns I'd go like this:

## Apply a function to individual columns
irisdt[, lapply(.SD, max), .SDcols = c("Petal.Length", "Petal.Width")]
#>    Petal.Length Petal.Width
#> 1:          6.9         2.5

In this case however I'd like to take out any row that isn't an NA in Sepal.Length and then return max and min along with the name of the column I subset for NA's. Below is an ugly way of implementing this but hopefully illustrates what I am after:

## Here is what the table would look like
desired_table <- rbind(
  irisdt[!is.na(Sepal.Length), .(max = max(Petal.Length), min = min(Petal.Length), var = "Sepal.Length")],
  irisdt[!is.na(Sepal.Width), .(max = max(Petal.Length), min = min(Petal.Length), var = "Sepal.Width")]
)

desired_table
#>    max min          var
#> 1: 6.9 1.2 Sepal.Length
#> 2: 6.7 1.0  Sepal.Width

Created on 2020-01-14 by the reprex package (v0.3.0)

Any thoughts on how I might accomplish this?

like image 733
boshek Avatar asked Jan 14 '20 23:01

boshek


People also ask

How do you apply a function to each column in a Dataframe in R?

Apply any function to all R data frame You can set the MARGIN argument to c(1, 2) or, equivalently, to 1:2 to apply the function to each value of the data frame. If you set MARGIN = c(2, 1) instead of c(1, 2) the output will be the same matrix but transposed.

What is .SD in data table?

SD stands for "Subset of Data. table". The dot before SD has no significance but doesn't let it clash with a user-defined column name.


1 Answers

melt may be better option if we are comparing by multiple columns. Reshape into 'long' format, then use i with the condition !is.na(value), while grouping by 'variable' and get the min and max of the specified variable

library(data.table)
melt(irisdt,  measure = c('Sepal.Length', 'Sepal.Width'))[!is.na(value),
   .(max = max(Petal.Length), min = min(Petal.Length)), .(variable)]

If we are doing this for multiple variables, then use the lapply(.SD, ...

like image 70
akrun Avatar answered Nov 14 '22 20:11

akrun