Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I filter a range of numbers in R? [duplicate]

Tags:

r

filtering

Let's say I have the data frame Mydata as shown below:

Mydata <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                     y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))

I want to filter this data frame and create another data frame, so that only the values of x between 3 and 7 and their corresponding y values are shown. I attempted the following:

new_frame <- Mydata %>% filter(x == (3:7))

This didn't work. How then would I filter for a specified range?

Thanks in advance for all help

like image 935
Paul Ibrahim Avatar asked Jun 29 '18 18:06

Paul Ibrahim


People also ask

How do I filter multiple numbers in R?

In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.


6 Answers

You can use %in%, or as has been mentioned, alternatively dplyrs between():

 library(dplyr)
 
 new_frame <- Mydata %>% filter(x %in% (3:7) )
 new_frame
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97

While %in% works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%, use dplyr's between():

 new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
 new_frame2
 #   x  y
 # 1 3 45
 # 2 4 54
 # 3 5 65
 # 4 6 78
 # 5 7 97     

To further clarify, note that %in% checks for the presence in a set of values:

3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE

The above return TRUE because 3:7 is shorthand for seq(3, 7) which produces:

3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7

As such, if you were to use %in% to check for values not produced by :, it will return FALSE:

4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE

Whereas between checks against the end points and all values in between:

between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE
like image 91
MHammer Avatar answered Oct 14 '22 04:10

MHammer


Plenty of good dplyr solutions such as filtering in or hard-coding the upper and lower bounds already present in some of the answers:

MydataTable%>% filter(between(x, 3, 70))
Mydata %>% filter(x %in% 3:7)
Mydata %>% filter(x>=3&x<=7)

You could also work with data.table, which is very fast for large data sets. inrange and between work identically for this purpose

library(data.table)
MydataTable <- data.table(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                          y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))
MydataTable[x %inrange% c(3,7)]
MydataTable[x %between% c(3,7)]

A benefit of this method (besides the speed of data.table) is that you only need to specify the min and max range - you are not creating an array to subset the filter.

A time comparison of these methods:

> df <- data.frame(x = sample(1:10, 10000000, replace = T),
+                      y = sample(1:10, 10000000, replace = T))
> system.time({ df %>% filter(between(x, 3, 7)) })
   user  system elapsed 
   0.18    0.05    0.14 
> system.time({ df %>% filter(x %in% 3:7) })
       user  system elapsed 
       0.19    0.06    0.29 
> system.time({ df %>% filter(x>=3&x<=7)  })
   user  system elapsed 
   0.17    0.09    0.26 

> dt <- data.table(df)
> system.time( {dt[x %inrange% c(3,7)] })
   user  system elapsed 
   0.13    0.07    0.21 
> system.time( {dt[x %between% c(3,7)] })
   user  system elapsed 
   0.18    0.05    0.13
like image 38
cacti5 Avatar answered Oct 14 '22 03:10

cacti5


And the good old base::subset:

subset(Mydata, x >= 3 & x <= 7)
subset(Mydata, x %in% 3:7)
like image 35
Moody_Mudskipper Avatar answered Oct 14 '22 04:10

Moody_Mudskipper


Piggybacking on @Anna 's answer, I just ran a few of the options to see which was faster on a larger dataset for a problem have at work. I used the setup from here (Faster way to subset on rows of a data frame in R?) I checked it on a 1 billion row (16gb) dataset. Looks like data.table edged out dplyr by a little bit. I am just starting to use data.table though so I may have not used the most efficient code. Oh, also, I narrowed it down to these 4 based on times from a 100million row dataset. See below:

set.seed(42)  
# 1 billion rows
df <- data.frame(age=sample(1:65,1e9,replace=TRUE),x=rnorm(1e9),y=rpois(1e9,25))



microbenchmark(df1 <- df %>% filter(age >= 5 & age <= 25),
               df2 <- df %>% filter(dplyr::between(df$age, 5, 25)),
               times=10)


Unit: seconds
                                      expr      min     lq   mean median     uq    max  neval
df %>% filter(age >= 5 & age <= 25)          15.327 15.796 16.526 16.601 17.086 17.996    10
df %>% filter(dplyr::between(df$age, 5, 25)) 14.214 14.752 15.413 15.487 16.121 16.447    10


DT <- as.data.table(df)
microbenchmark(dt1 <- DT[age %inrange% c(5, 25)],
               dt2 <- DT[age %between% c(5, 25)],
               times = 10)


Unit: seconds
                              expr    min     lq   mean median     uq    max neval
 dt1 <- DT[age %inrange% c(5, 25)] 15.122 16.042 17.180 16.969 17.310 22.138    10
 dt2 <- DT[age %between% c(5, 25)] 10.212 11.121 11.675 11.436 12.132 13.913    10
like image 37
Andrew Avatar answered Oct 14 '22 05:10

Andrew


Base R solution:

df <- Mydata[Mydata$x >= 3 & Mydata$x <= 7, ]

df
  x  y
3 3 45
4 4 54
5 5 65
6 6 78
7 7 97
like image 34
sm925 Avatar answered Oct 14 '22 04:10

sm925


The above answer is probably more user friendly but here are a couple more...

Mydata[Mydata$x >= 3 & Mydata$x <= 7, ]

  x  y
3 3 45
4 4 54
5 5 65
6 6 78
7 7 97

Which can be extended to return other columns, e.g., if you just wanted y:

Mydata[Mydata$x >= 3 & Mydata$x <= 7, 'y']

[1] 45 54 65 78 97

It can also return more than one column,e.g.:

Mydata <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                     y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156),
                     z = c(5, 4, 3, 2, 1, 0, -1, -2, -3, -4))   

Mydata[Mydata$x >= 3 & Mydata$x <= 7, c('y','z')]

   y  z
3 45  3
4 54  2
5 65  1
6 78  0
7 97 -1
like image 29
antimuon Avatar answered Oct 14 '22 04:10

antimuon