I have a dataframe and a predictive model that I want to apply to the data. However, I want to filter out records for which the model might not apply very well. To do this, I have another dataframe that contains for every variable the minimum and maximum observed in the training data. I want to remove those records from my new data for which one or more values fall outside the specified range.
To make my question clear, this is what my data might look like:
id x y
---- ---- ---------
1 2 30521
2 -1 1835
3 5 25939
4 4 1000000
This is what my second table, with the mins and maxes, could look like:
var min max
----- ----- -------
x 1 5
y 0 99999
In this example, I would want to flag the following records in my data: 2 (lower than the minimum for x) and 4 (higher than the max for y).
How could I easily do this in R? I have a hunch there's some clever dplyr
code that would accomplish this task, but I wouldn't know what it would look like.
You have your data like this:
df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10))
limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999))
You can use the sweep
function with operator '>'
and '<'
it's quite straightforward!
sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<')
#### x y
#### [1,] TRUE TRUE
#### [2,] FALSE TRUE
#### [3,] FALSE FALSE
#### [4,] TRUE FALSE
#### [5,] FALSE FALSE
#### [6,] FALSE TRUE
The TRUE locations tell you which observations to keep for each variable. It should work for any number of variables
After that if you need the global flag (at least flag in one column) you can run this simple line (res being the previous output)
apply(res, 1, all)
#### [1] TRUE FALSE FALSE FALSE FALSE FALSE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With