Suppose I have a data.table
with a few columns:
a <- data.table(id=1:1000, x=runif(100), y=runif(100), z=runif(100))
I want to drop the rows where x
, y
or z
is below the median:
a <- a[ x > median(x) & y > median(y) & z > median(z) ]
(aside: does the above call median
3 times or 3000 times?)
What I do is
my.cols <- c("x","y","z")
my.meds <- sapply(my.cols, function(n) median(a[[n]]))
a <- a[ Reduce(`&`,Map(function(i) a[[my.cols[i]]] > my.meds[i], 1:length(my.cols))) ]
Is this the best I could do?
When you find yourself iterating through columns, often long format can be easier. So just to show that option in this case, although it's not great.
The following is not fast because it uses [
on .SD
. But here's the logic anyway, for when we get round to optimizing .SD[...]
.
> a
id x y z
1: 60 0.006884017 0.9159115 0.876148
2: 160 0.006884017 0.9159115 0.876148
3: 260 0.006884017 0.9159115 0.876148
4: 360 0.006884017 0.9159115 0.876148
5: 460 0.006884017 0.9159115 0.876148
---
996: 504 0.990417986 0.7167666 0.751991
997: 604 0.990417986 0.7167666 0.751991
998: 704 0.990417986 0.7167666 0.751991
999: 804 0.990417986 0.7167666 0.751991
1000: 904 0.990417986 0.7167666 0.751991
> require(reshape2) # but data.table v1.8.11 has a fast melt built-in
> DT = as.data.table(melt(a, "id")) # copies here => bad
> DT
id variable value
1: 60 x 0.006884017
2: 160 x 0.006884017
3: 260 x 0.006884017
4: 360 x 0.006884017
5: 460 x 0.006884017
---
2996: 504 z 0.751991033
2997: 604 z 0.751991033
2998: 704 z 0.751991033
2999: 804 z 0.751991033
3000: 904 z 0.751991033
Now the data is in long format (can it be in long format in the first place?), the following step is easier :
> DT[, below:=value<median(value), by=variable]
> DT
id variable value below
1: 60 x 0.006884017 TRUE
2: 160 x 0.006884017 TRUE
3: 260 x 0.006884017 TRUE
4: 360 x 0.006884017 TRUE
5: 460 x 0.006884017 TRUE
---
2996: 504 z 0.751991033 FALSE
2997: 604 z 0.751991033 FALSE
2998: 704 z 0.751991033 FALSE
2999: 804 z 0.751991033 FALSE
3000: 904 z 0.751991033 FALSE
> DT[below==TRUE, .SD[.N==3], by=id]
id variable value below
1: 88 x 0.01873885 TRUE
2: 88 y 0.05834677 TRUE
3: 88 z 0.08973225 TRUE
4: 188 x 0.01873885 TRUE
5: 188 y 0.05834677 TRUE
---
356: 848 y 0.39433186 TRUE
357: 848 z 0.14152092 TRUE
358: 948 x 0.48932049 TRUE
359: 948 y 0.39433186 TRUE
360: 948 z 0.14152092 TRUE
Then dcast
that back if it needs to be wide. But I try and keep things long, like a database.
There's probably a more direct way to do the above, and maybe a way to avoid the .SD[...]
for speed, too.
Aside : I thought about setkey(a,x)
then just taking the first half. That works for one column. But then the second needs y<median(y)
where median(y)
is across all of a
, so you can't just setkey on the first half by y and then half again, and half again for z, because of that. But if something like that was possible, that would be very specialized to median
which was just an illustrative example in the question I assume.
One option is to construct the string you want and eval/parse
it:
EVAL = function(...)eval(parse(text=paste0(...))) # standard helper function
a[ EVAL(my.cols, ">median(", my.cols, ")", collapse=" & ") ]
My preferred way is to prepare the statement as I would be running it directly, so:
library(data.table)
a = data.table(id=1:1000, x=runif(100), y=runif(100), z=runif(100))
upper = c("x","y","z")
l = lapply(upper, function(col) call(">", as.name(col), call("median", as.name(col))))
ii = Reduce(function(c1, c2) substitute(.c1 & .c2, list(.c1=c1, .c2=c2)), l)
ii
#x > median(x) & y > median(y) & z > median(z)
a[eval(ii)]
# id x y z
# 1: 7 0.4750376 0.8936338 0.6158251
# ...
You basically use eval
in i
argument on your prepared statement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With