I'm having a hard time understanding why ==
and %in%
would produce different results when applied to character vectors that depend, it seems, only on vectors' encoding. An example:
a <- 'Köln'
Encoding(a) <- 'unknown'
Encoding(a)
# [1] "unknown"
b <- a
Encoding(b) <- 'UTF-8'
a == b
# [1] TRUE
a %in% b
# [1] FALSE
Update:
It appears the result is also platform-dependent. The two statements return:
TRUE
and FALSE
on R 3.3.0 on OS X 10.11.5FALSE
and FALSE
on R 3.3.0 on Windows 10 (64 bit)TRUE
and TRUE
on R 3.2.3 on CentOS 7I'm starting to think this is a bug.
There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.
There are different types of Character Encoding techniques, which are given below: HTML Encoding. URL Encoding. Unicode Encoding.
The most common ones being windows 1252 and Latin-1 (ISO-8859).
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.
It is indeed a bug, and it was fixed in 3.3.1.
The behavior is actually a bit weirder than your example indicates, in that you only get FALSE
when you have one element on the left-hand side of %in%
:
> a %in% b
[1] FALSE
> c(a, a) %in% b
[1] TRUE TRUE
As implied by the comments, %in%
just calls match
, so the problem can be seen there too:
> match(a, b)
[1] NA
> match(c(a, a), b)
[1] 1 1
The important arguments to %in%
and match
are x
and table
, where either function searches for x
in table
. Under the hood, R does this in the match5
function defined in unique.c
. In the case where you have more than one x
, match5
will create a hash table from table
to enable fast lookups. If you dig through the code, you'll see that the comparison is done in a function called sequal
, which returns Seql(STRING_ELT(x, i), STRING_ELT(y, j))
(well, it's actually a bit more complex than this*). Then if you go look at Seql
in memory.c
, you'll find:
int result = !strcmp(translateCharUTF8(a), translateCharUTF8(b));
Which, as you can see, converts the strings to UTF-8.
However, if x
only has one element, it's silly to go through the trouble of creating a hash table, since we can just scan through table
once to see if x
is there. In 3.3.0, the code to check for equality between x
and each element of table
didn't use Seql
and didn't convert the string to UTF-8. But starting in 3.3.1, Seql
is used, so the behavior is fixed.
* A little aside on string equality: R will actually cache strings so that it doesn't have to store a bunch of copies. So if two strings are at the same location, they're equal and there's no need to check further!
> .Internal(inspect("Köln"))
@10321b758 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
@106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
> .Internal(inspect(b))
@106831cd8 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0)
@106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With