I'm having a hard time understanding why == and %in% would produce different results when applied to character vectors that depend, it seems, only on vectors' encoding. An example:
a <- 'Köln'
Encoding(a) <- 'unknown'
Encoding(a)
# [1] "unknown"
b <- a
Encoding(b) <- 'UTF-8'
a == b
# [1] TRUE
a %in% b
# [1] FALSE
Update:
It appears the result is also platform-dependent. The two statements return:
TRUE and FALSE on R 3.3.0 on OS X 10.11.5FALSE and FALSE on R 3.3.0 on Windows 10 (64 bit)TRUE and TRUE on R 3.2.3 on CentOS 7I'm starting to think this is a bug.
There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.
There are different types of Character Encoding techniques, which are given below: HTML Encoding. URL Encoding. Unicode Encoding.
The most common ones being windows 1252 and Latin-1 (ISO-8859).
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.
It is indeed a bug, and it was fixed in 3.3.1.
The behavior is actually a bit weirder than your example indicates, in that you only get FALSE when you have one element on the left-hand side of %in%:
> a %in% b
[1] FALSE
> c(a, a) %in% b
[1] TRUE TRUE
As implied by the comments, %in% just calls match, so the problem can be seen there too:
> match(a, b)
[1] NA
> match(c(a, a), b)
[1] 1 1
The important arguments to %in% and match are x and table, where either function searches for x in table. Under the hood, R does this in the match5 function defined in unique.c. In the case where you have more than one x, match5 will create a hash table from table to enable fast lookups. If you dig through the code, you'll see that the comparison is done in a function called sequal, which returns Seql(STRING_ELT(x, i), STRING_ELT(y, j)) (well, it's actually a bit more complex than this*). Then if you go look at Seql in memory.c, you'll find:
int result = !strcmp(translateCharUTF8(a), translateCharUTF8(b));
Which, as you can see, converts the strings to UTF-8.
However, if x only has one element, it's silly to go through the trouble of creating a hash table, since we can just scan through table once to see if x is there. In 3.3.0, the code to check for equality between x and each element of table didn't use Seql and didn't convert the string to UTF-8. But starting in 3.3.1, Seql is used, so the behavior is fixed.
* A little aside on string equality: R will actually cache strings so that it doesn't have to store a bunch of copies. So if two strings are at the same location, they're equal and there's no need to check further!
> .Internal(inspect("Köln"))
@10321b758 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
@106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
> .Internal(inspect(b))
@106831cd8 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0)
@106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With