Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove records from dataframe that fall outside variable-specific ranges? [R]

Tags:

r

outliers

I have a dataframe and a predictive model that I want to apply to the data. However, I want to filter out records for which the model might not apply very well. To do this, I have another dataframe that contains for every variable the minimum and maximum observed in the training data. I want to remove those records from my new data for which one or more values fall outside the specified range.

To make my question clear, this is what my data might look like:

  id   x       y     
 ---- ---- --------- 
   1    2     30521  
   2   -1      1835  
   3    5     25939  
   4    4   1000000  

This is what my second table, with the mins and maxes, could look like:

  var   min    max   
 ----- ----- ------- 
  x       1       5  
  y       0   99999  

In this example, I would want to flag the following records in my data: 2 (lower than the minimum for x) and 4 (higher than the max for y).

How could I easily do this in R? I have a hunch there's some clever dplyr code that would accomplish this task, but I wouldn't know what it would look like.

like image 947
A. Stam Avatar asked Nov 09 '22 07:11

A. Stam


1 Answers

You have your data like this:

df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10))
limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999))

You can use the sweep function with operator '>' and '<' it's quite straightforward!

sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<')
####          x     y
#### [1,]  TRUE  TRUE
#### [2,] FALSE  TRUE
#### [3,] FALSE FALSE
#### [4,]  TRUE FALSE
#### [5,] FALSE FALSE
#### [6,] FALSE  TRUE

The TRUE locations tell you which observations to keep for each variable. It should work for any number of variables

After that if you need the global flag (at least flag in one column) you can run this simple line (res being the previous output)

apply(res, 1, all)
#### [1]  TRUE FALSE FALSE FALSE FALSE FALSE
like image 106
agenis Avatar answered Nov 15 '22 06:11

agenis