Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R search in multiple datatable columns

I need to subset a data.table based on a condition across multiple columns and then perform an operation on the result.

A minimal example with a simple data.table:

x <- data.table(id=c(1, 2, 3, 4), colour1 = c('red', 'green', 'green', 'blue'), 
                colour2 = c('yellow', 'red', 'blue', 'black'), 
                colour3 = c('blue', 'black', 'red', 'yellow'),
                score = c(0.7, 0.9, 0.2, 0.35))

I then want to find the maximum score for any row that contains the colour 'yellow':

max_score <- max(x[colour1 == 'yellow' | colour2 == 'yellow' | colour3 == 'yellow']$score)

In this example, the above code works fine but is there a better way of checking this condition across multiple columns? In practice, the table will be much larger and the number of columns can change so I'd like to do this dynamically.

like image 475
Danny Friar Avatar asked Dec 14 '22 05:12

Danny Friar


2 Answers

It may feel difficult because your data is in a wide (i.e. "untidy") format. Just convert to long format using melt and it's really simple.

melt(x, measure.vars = grep("^colo", names(x)))[,max(score[value == "yellow"])]
# [1] 0.7

or

melt(x, measure.vars = grep("^colo", names(x)))[value == "yellow", max(score)]

To see what's going on, run this first

melt(x, measure.vars = grep("^colo", names(x)))

and then the whole command.


Of course you can do it in two steps too, in case you want to keep the long format for further calculations:

x_tidy <- melt(x, measure.vars = grep("^colo", names(x)))
x_tidy[value == "yellow", max(score)]
#[1] 0.7
like image 76
talat Avatar answered Dec 22 '22 00:12

talat


Using rowSums:

max(
  x[ rowSums(x[, grepl("colour", colnames(x)), with = FALSE] == "yellow") > 0,
     "score", with = FALSE]
  )
like image 26
zx8754 Avatar answered Dec 22 '22 00:12

zx8754