I need to subset a data.table based on a condition across multiple columns and then perform an operation on the result.
A minimal example with a simple data.table:
x <- data.table(id=c(1, 2, 3, 4), colour1 = c('red', 'green', 'green', 'blue'),
colour2 = c('yellow', 'red', 'blue', 'black'),
colour3 = c('blue', 'black', 'red', 'yellow'),
score = c(0.7, 0.9, 0.2, 0.35))
I then want to find the maximum score for any row that contains the colour 'yellow':
max_score <- max(x[colour1 == 'yellow' | colour2 == 'yellow' | colour3 == 'yellow']$score)
In this example, the above code works fine but is there a better way of checking this condition across multiple columns? In practice, the table will be much larger and the number of columns can change so I'd like to do this dynamically.
It may feel difficult because your data is in a wide (i.e. "untidy") format. Just convert to long format using melt
and it's really simple.
melt(x, measure.vars = grep("^colo", names(x)))[,max(score[value == "yellow"])]
# [1] 0.7
or
melt(x, measure.vars = grep("^colo", names(x)))[value == "yellow", max(score)]
To see what's going on, run this first
melt(x, measure.vars = grep("^colo", names(x)))
and then the whole command.
Of course you can do it in two steps too, in case you want to keep the long format for further calculations:
x_tidy <- melt(x, measure.vars = grep("^colo", names(x)))
x_tidy[value == "yellow", max(score)]
#[1] 0.7
Using rowSums:
max(
x[ rowSums(x[, grepl("colour", colnames(x)), with = FALSE] == "yellow") > 0,
"score", with = FALSE]
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With