If I want to select a subset of data in R, I can use the subset function. I wanted to base an analysis on data that that was matching one of a few criteria, e.g. that a certain variable was either 1, 2 or 3. I tried
myNewDataFrame <- subset(bigfive, subset = (bigfive$bf11==(1||2||3)))
It did always just select values that matched the first of the criteria, here 1. My assumption was that it would start with 1 and if it does evaluate to "false" it would go on to 2 and than to 3, and if none matches the statement after == is "false" and if one of them matches, it is "true".
I got the right result using
newDataFrame <- subset(bigfive, subset = (bigfive$bf11==c(1,2,3)))
But I would like to be able to select data via logical operators, so: why did the first approach not work?
Multiple conditions can also be combined using which() method in R. The which() function in R returns the position of the value which satisfies the given condition. The %in% operator is used to check a value in the vector specified.
Subsetting both rows and columnsIt is possible to subset both rows and columns using the subset function. The select argument lets you subset variables (columns).
& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined.
We can do this by using the if statement. We first assign the variable x , and then write the if condition. In this case, assign -3 to x , and set the if condition to be true if x is smaller than 0 ( x < 0 ). If we run the example code, we indeed see that the string “x is a negative number” gets printed out.
The correct operator is %in%
here. Here is an example with dummy data:
set.seed(1)
dat <- data.frame(bf11 = sample(4, 10, replace = TRUE),
foo = runif(10))
giving:
> head(dat)
bf11 foo
1 2 0.2059746
2 2 0.1765568
3 3 0.6870228
4 4 0.3841037
5 1 0.7698414
6 4 0.4976992
The subset of dat
where bf11
equals any of the set 1,2,3
is taken as follows using %in%
:
> subset(dat, subset = bf11 %in% c(1,2,3))
bf11 foo
1 2 0.2059746
2 2 0.1765568
3 3 0.6870228
5 1 0.7698414
8 3 0.9919061
9 3 0.3800352
10 1 0.7774452
As to why your original didn't work, break it down to see the problem. Look at what 1||2||3
evaluates to:
> 1 || 2 || 3
[1] TRUE
and you'd get the same using |
instead. As a result, the subset()
call would only return rows where bf11
was TRUE
(or something that evaluated to TRUE
).
What you could have written would have been something like:
subset(dat, subset = bf11 == 1 | bf11 == 2 | bf11 == 3)
Which gives the same result as my earlier subset()
call. The point is that you need a series of single comparisons, not a comparison of a series of options. But as you can see, %in%
is far more useful and less verbose in such circumstances. Notice also that I have to use |
as I want to compare each element of bf11
against 1
, 2
, and 3
, in turn. Compare:
> with(dat, bf11 == 1 || bf11 == 2)
[1] TRUE
> with(dat, bf11 == 1 | bf11 == 2)
[1] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
For your example, I believe the following should work:
myNewDataFrame <- subset(bigfive, subset = bf11 == 1 | bf11 == 2 | bf11 == 3)
See the examples in ?subset
for more. Just to demonstrate, a more complicated logical subset would be:
data(airquality)
dat <- subset(airquality, subset = (Temp > 80 & Month > 5) | Ozone < 40)
And as Chase points out, %in%
would be more efficient in your example:
myNewDataFrame <- subset(bigfive, subset = bf11 %in% c(1, 2, 3))
As Chase also points out, make sure you understand the difference between |
and ||
. To see help pages for operators, use ?'||'
, where the operator is quoted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With