Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter data frame based on multiple conditions in another data frame

Tags:

r

subset

I have two data frames:

  • df where for single x there are multiple values y
  • matchDf which defines constraints I would like to use to filter df

I would like to filter only rows where y is lower or equal to max_y given that x falls in interval <x_from, x_to)

Is there any way to do this in R (without SQL I used below)?

df <- read.table(header = TRUE, text = '
x y
100 0.1
100 0.2
100 0.3
250 0.2
250 0.3
250 0.4
375 0.2
375 0.25
375 0.35
420 0.15
420 0.16
420 0.17
500 0.23
500 0.55')

matchDf <- read.table(header = TRUE, text = '
x_from x_to max_y
0 300 .2
300 500 .3
500 99999 .5
')

library(sqldf)

sqldf('select a.* 
      from 
        df a 
        join matchDf b on (a.x >= b.x_from 
                           and a.x < b.x_to 
                           and a.y <= b.max_y)'
      )
like image 985
Tomas Greif Avatar asked Mar 16 '23 17:03

Tomas Greif


1 Answers

Try this:

df[df$y <= matchDf$max_y[cut(df$x, c(0,matchDf$x_to))],]
     x    y
1  100 0.10
2  100 0.20
4  250 0.20
7  375 0.20
8  375 0.25
10 420 0.15
11 420 0.16
12 420 0.17
13 500 0.23

What's happening here is that cut is giving you which row in matchDf each observation in df should be in. Then, you just use that as a positional row extraction vector, and state the conditional relationship for y using <=.

To see how cut works, just pull it out of the expression:

> cut(df$x, c(0,matchDf$x_to))
 [1] (0,300]   (0,300]   (0,300]   (0,300]   (0,300]   (0,300]   (300,500] (300,500] (300,500] (300,500] (300,500] (300,500] (300,500] (300,500]
Levels: (0,300] (300,500] (500,1e+05]

The level labels are irrelevant, because [ uses the underlying integer values to extract.

like image 56
Thomas Avatar answered Mar 19 '23 05:03

Thomas