I've written the following function based on subset()
, which I find handy:
ss <- function (x, subset, ...)
{
r <- eval(substitute(subset), data.frame(.=x), parent.frame())
if (!is.logical(r))
stop("'subset' must be logical")
x[r & !is.na(r)]
}
So, I can write:
ss(myDataFrame$MyVariableName, 500 < . & . < 1500)
instead of
myDataFrame$MyVariableName[ 500 < myDataFrame$MyVariableName
& myDataFrame$MyVariableName < 1500]
This seems like something other people might have developed solutions for, though - including something in core R I might have missed. Anything already out there?
The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
The difference between subset () function and sample () is that, subset () is used to select data from the dataset which meets certain condition, while sample () is used for randomly selecting data of size 'n' from the dataset.
I realize that the solution Ken offers is more general than just selecting items within ranges (since it should work on any logical expression) but this did remind me that Greg Snow has comparison infix operators in his Teaching Demos package:
library(TeachingDemos)
x0 <- rnorm(100)
x0[ 0 %<% x0 %<% 1.5 ]
Thanks for sharing Ken.
You could use:
x <- myDataFrame$MyVariableName; x[x > 100 & x < 180]
Yours may require less typing but the code is less generalizable to others if you're sharing code. I have a few time saver functions like that myself but use them sparingly because they may be slowing down your code (extra steps) and requires you to also include that code for that function when ever you share the file with someone else.
Compare writing length. Almost the same length:
ss(mtcars$hp, 100 < . & . < 180)
x <- mtcars$hp; x[x > 100 & x < 180]
Compare time on 1000 replications.
library(rbenchmark)
benchmark(
tyler = x[x > 100 & x < 180],
ken = ss(mtcars$hp, 100 <. & . < 180),
replications=1000)
test replications elapsed relative user.self sys.self user.child sys.child
2 ken 1000 0.56 18.66667 0.36 0.03 NA NA
1 tyler 1000 0.03 1.00000 0.03 0.00 NA NA
So I guess it depends on if you need speed and/or sharability vs convenience. If it's just for you on a small data set I'd say it's valuable.
EDIT: NEW BENCHMARKING
> benchmark(
+ tyler = {x <- mtcars$hp; x[x > 100 & x < 180]},
+ ken = ss(mtcars$hp, 100 <. & . < 180),
+ ken2 = ss2(mtcars$hp, 100 <. & . < 180),
+ joran = with(mtcars,hp[hp>100 & hp< 180 ]),
+ replications=10000)
test replications elapsed relative user.self sys.self user.child sys.child
4 joran 10000 0.83 2.677419 0.69 0.00 NA NA
2 ken 10000 3.79 12.225806 3.45 0.02 NA NA
3 ken2 10000 0.67 2.161290 0.35 0.00 NA NA
1 tyler 10000 0.31 1.000000 0.20 0.00 NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With