Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An explanation on the behaviour of the "==" operator

Tags:

r

In the following very simple example I cannot understand the behavior of the "==" operator.

A <- c(10, 20, 10, 10, 20, 30)
B <- c(40, 50, 60, 70, 80, 90)

df <- data.frame(A, B)

df[df$A == c(10,20), ]      # it returns 3 lines instead of 5
df[df$A %in% c(10,20), ]    # it works properly and returns 5 lines

Thank you in advance all of you.

like image 835
Apostolos Avatar asked Jun 30 '15 22:06

Apostolos


2 Answers

To understand what is going on you have to understand data frame structure and recycling rules. Data frame is simply a list of vectors.

> unclass(df)
$A
[1] 10 20 10 10 20 30

$B
[1] 50 60 50 40 70 80

attr(,"row.names")
[1] 1 2 3 4 5 6

If you compare two vectors of different length in R the shorter one is recycled. In your case df$A == c(10,20) is equivalent to:

> c(10, 20, 10, 10, 20, 30) == c(10, 20, 10, 20, 10, 20)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

and

> df[c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE), ]
   A  B
1 10 50
2 20 60
3 10 50

From the %in% documentation:

%in% returns a logical vector indicating if there is a match or not for its left operand

> df$A %in% c(10,20)
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

and

> df[c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE), ]
   A  B
1 10 50
2 20 60
3 10 50
4 10 40
5 20 70
like image 71
zero323 Avatar answered Nov 12 '22 03:11

zero323


Here is my solution that I hope add some insights to other (very good) answers. As stated in "The art of R programming" by Norman Matloff:

When applying an operation to two vectors that requires them to be the same length, R automatically recycles, or repeats, the shorter one, until it is long enough to match the longer one

if the concept is still not clear. Take a look at this and try to guess the output:

c(10, 10, 10, 10, 10, 10) == c(10, 20)

which will give:

[1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

because it recycles the "shorter" vector and by doing so it compares the first 10 on the right to the first on the left (and that's TRUE) but compares the second ten with the 20 (the second element of the vector on the right) and that's FALSE; after that R recycles the shorter vector (which is the one on the right) and the game starts again.

like image 31
SabDeM Avatar answered Nov 12 '22 03:11

SabDeM