Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter data.table by multiple columns, dynamically

Tags:

r

data.table

Suppose I have a data.table with a few columns:

a <- data.table(id=1:1000, x=runif(100), y=runif(100), z=runif(100))

I want to drop the rows where x, y or z is below the median:

a <- a[ x > median(x) & y > median(y) & z > median(z) ]

(aside: does the above call median 3 times or 3000 times?)

What I do is

my.cols <- c("x","y","z")
my.meds <- sapply(my.cols, function(n) median(a[[n]]))
a <- a[ Reduce(`&`,Map(function(i) a[[my.cols[i]]] > my.meds[i], 1:length(my.cols))) ]

Is this the best I could do?

like image 234
sds Avatar asked Jan 09 '14 17:01

sds


3 Answers

When you find yourself iterating through columns, often long format can be easier. So just to show that option in this case, although it's not great.

The following is not fast because it uses [ on .SD. But here's the logic anyway, for when we get round to optimizing .SD[...].

> a
       id           x         y        z
   1:  60 0.006884017 0.9159115 0.876148
   2: 160 0.006884017 0.9159115 0.876148
   3: 260 0.006884017 0.9159115 0.876148
   4: 360 0.006884017 0.9159115 0.876148
   5: 460 0.006884017 0.9159115 0.876148
  ---                                   
 996: 504 0.990417986 0.7167666 0.751991
 997: 604 0.990417986 0.7167666 0.751991
 998: 704 0.990417986 0.7167666 0.751991
 999: 804 0.990417986 0.7167666 0.751991
1000: 904 0.990417986 0.7167666 0.751991

> require(reshape2)    # but data.table v1.8.11 has a fast melt built-in
> DT = as.data.table(melt(a, "id"))    # copies here => bad
> DT
       id variable       value
   1:  60        x 0.006884017
   2: 160        x 0.006884017
   3: 260        x 0.006884017
   4: 360        x 0.006884017
   5: 460        x 0.006884017
  ---                         
2996: 504        z 0.751991033
2997: 604        z 0.751991033
2998: 704        z 0.751991033
2999: 804        z 0.751991033
3000: 904        z 0.751991033

Now the data is in long format (can it be in long format in the first place?), the following step is easier :

> DT[, below:=value<median(value), by=variable]
> DT
       id variable       value below
   1:  60        x 0.006884017  TRUE
   2: 160        x 0.006884017  TRUE
   3: 260        x 0.006884017  TRUE
   4: 360        x 0.006884017  TRUE
   5: 460        x 0.006884017  TRUE
  ---                               
2996: 504        z 0.751991033 FALSE
2997: 604        z 0.751991033 FALSE
2998: 704        z 0.751991033 FALSE
2999: 804        z 0.751991033 FALSE
3000: 904        z 0.751991033 FALSE

> DT[below==TRUE, .SD[.N==3], by=id]
      id variable      value below
  1:  88        x 0.01873885  TRUE
  2:  88        y 0.05834677  TRUE
  3:  88        z 0.08973225  TRUE
  4: 188        x 0.01873885  TRUE
  5: 188        y 0.05834677  TRUE
 ---                              
356: 848        y 0.39433186  TRUE
357: 848        z 0.14152092  TRUE
358: 948        x 0.48932049  TRUE
359: 948        y 0.39433186  TRUE
360: 948        z 0.14152092  TRUE

Then dcast that back if it needs to be wide. But I try and keep things long, like a database.

There's probably a more direct way to do the above, and maybe a way to avoid the .SD[...] for speed, too.

Aside : I thought about setkey(a,x) then just taking the first half. That works for one column. But then the second needs y<median(y) where median(y) is across all of a, so you can't just setkey on the first half by y and then half again, and half again for z, because of that. But if something like that was possible, that would be very specialized to median which was just an illustrative example in the question I assume.

like image 119
Matt Dowle Avatar answered Nov 17 '22 03:11

Matt Dowle


One option is to construct the string you want and eval/parse it:

EVAL = function(...)eval(parse(text=paste0(...)))   # standard helper function

a[ EVAL(my.cols, ">median(", my.cols, ")", collapse=" & ") ]
like image 33
eddi Avatar answered Nov 17 '22 02:11

eddi


My preferred way is to prepare the statement as I would be running it directly, so:

library(data.table)
a = data.table(id=1:1000, x=runif(100), y=runif(100), z=runif(100))
upper = c("x","y","z")
l = lapply(upper, function(col) call(">", as.name(col), call("median", as.name(col))))
ii = Reduce(function(c1, c2) substitute(.c1 & .c2, list(.c1=c1, .c2=c2)), l)
ii
#x > median(x) & y > median(y) & z > median(z)
a[eval(ii)]
#       id         x         y         z
#  1:    7 0.4750376 0.8936338 0.6158251
# ...

You basically use eval in i argument on your prepared statement.

like image 1
jangorecki Avatar answered Nov 17 '22 02:11

jangorecki