Let's say I have the data frame Mydata as shown below:
Mydata <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))
I want to filter this data frame and create another data frame, so that only the values of x
between 3
and 7
and their corresponding y
values are shown. I attempted the following:
new_frame <- Mydata %>% filter(x == (3:7))
This didn't work. How then would I filter for a specified range?
Thanks in advance for all help
In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.
You can use %in%
, or as has been mentioned, alternatively dplyr
s between()
:
library(dplyr)
new_frame <- Mydata %>% filter(x %in% (3:7) )
new_frame
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97
While %in%
works great for integers (or other equally spaced sequences), if you need to filter on floats, or any value between and including your two end points, or just want an alternative that's a bit more explicit than %in%
, use dplyr
's between()
:
new_frame2 <- Mydata%>% filter( between(x, 3, 7) )
new_frame2
# x y
# 1 3 45
# 2 4 54
# 3 5 65
# 4 6 78
# 5 7 97
To further clarify, note that %in%
checks for the presence in a set of values:
3 %in% 3:7
# [1] TRUE
5 %in% 3:7
# [1] TRUE
5.0 %in% 3:7
# [1] TRUE
The above return TRUE
because 3:7
is shorthand for seq(3, 7)
which produces:
3:7
# [1] 3 4 5 6 7
seq(3, 7)
# [1] 3 4 5 6 7
As such, if you were to use %in%
to check for values not produced by :
, it will return FALSE
:
4.5 %in% 3:7
# [1] FALSE
4.15 %in% 3:7
# [1] FALSE
Whereas between
checks against the end points and all values in between:
between(3, 3, 7)
# [1] TRUE
between(7, 3, 7)
# [1] TRUE
between(5, 3, 7)
# [1] TRUE
between(5.0, 3, 7)
# [1] TRUE
between(4.5, 3, 7)
# [1] TRUE
between(4.15, 3, 7)
# [1] TRUE
Plenty of good dplyr solutions such as filtering in or hard-coding the upper and lower bounds already present in some of the answers:
MydataTable%>% filter(between(x, 3, 70))
Mydata %>% filter(x %in% 3:7)
Mydata %>% filter(x>=3&x<=7)
You could also work with data.table, which is very fast for large data sets. inrange
and between
work identically for this purpose
library(data.table)
MydataTable <- data.table(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156))
MydataTable[x %inrange% c(3,7)]
MydataTable[x %between% c(3,7)]
A benefit of this method (besides the speed of data.table) is that you only need to specify the min and max range - you are not creating an array to subset the filter.
A time comparison of these methods:
> df <- data.frame(x = sample(1:10, 10000000, replace = T),
+ y = sample(1:10, 10000000, replace = T))
> system.time({ df %>% filter(between(x, 3, 7)) })
user system elapsed
0.18 0.05 0.14
> system.time({ df %>% filter(x %in% 3:7) })
user system elapsed
0.19 0.06 0.29
> system.time({ df %>% filter(x>=3&x<=7) })
user system elapsed
0.17 0.09 0.26
> dt <- data.table(df)
> system.time( {dt[x %inrange% c(3,7)] })
user system elapsed
0.13 0.07 0.21
> system.time( {dt[x %between% c(3,7)] })
user system elapsed
0.18 0.05 0.13
And the good old base::subset
:
subset(Mydata, x >= 3 & x <= 7)
subset(Mydata, x %in% 3:7)
Piggybacking on @Anna 's answer, I just ran a few of the options to see which was faster on a larger dataset for a problem have at work. I used the setup from here (Faster way to subset on rows of a data frame in R?) I checked it on a 1 billion row (16gb) dataset. Looks like data.table edged out dplyr by a little bit. I am just starting to use data.table though so I may have not used the most efficient code. Oh, also, I narrowed it down to these 4 based on times from a 100million row dataset. See below:
set.seed(42)
# 1 billion rows
df <- data.frame(age=sample(1:65,1e9,replace=TRUE),x=rnorm(1e9),y=rpois(1e9,25))
microbenchmark(df1 <- df %>% filter(age >= 5 & age <= 25),
df2 <- df %>% filter(dplyr::between(df$age, 5, 25)),
times=10)
Unit: seconds
expr min lq mean median uq max neval
df %>% filter(age >= 5 & age <= 25) 15.327 15.796 16.526 16.601 17.086 17.996 10
df %>% filter(dplyr::between(df$age, 5, 25)) 14.214 14.752 15.413 15.487 16.121 16.447 10
DT <- as.data.table(df)
microbenchmark(dt1 <- DT[age %inrange% c(5, 25)],
dt2 <- DT[age %between% c(5, 25)],
times = 10)
Unit: seconds
expr min lq mean median uq max neval
dt1 <- DT[age %inrange% c(5, 25)] 15.122 16.042 17.180 16.969 17.310 22.138 10
dt2 <- DT[age %between% c(5, 25)] 10.212 11.121 11.675 11.436 12.132 13.913 10
Base R solution:
df <- Mydata[Mydata$x >= 3 & Mydata$x <= 7, ]
df
x y
3 3 45
4 4 54
5 5 65
6 6 78
7 7 97
The above answer is probably more user friendly but here are a couple more...
Mydata[Mydata$x >= 3 & Mydata$x <= 7, ]
x y
3 3 45
4 4 54
5 5 65
6 6 78
7 7 97
Which can be extended to return other columns, e.g., if you just wanted y:
Mydata[Mydata$x >= 3 & Mydata$x <= 7, 'y']
[1] 45 54 65 78 97
It can also return more than one column,e.g.:
Mydata <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
y = c(20, 30, 45, 54, 65, 78, 97, 102, 123, 156),
z = c(5, 4, 3, 2, 1, 0, -1, -2, -3, -4))
Mydata[Mydata$x >= 3 & Mydata$x <= 7, c('y','z')]
y z
3 45 3
4 54 2
5 65 1
6 78 0
7 97 -1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With