Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

== and %in% differ based on character encoding?

I'm having a hard time understanding why == and %in% would produce different results when applied to character vectors that depend, it seems, only on vectors' encoding. An example:

a <- 'Köln'
Encoding(a) <- 'unknown'
Encoding(a)
# [1] "unknown"

b <- a
Encoding(b) <- 'UTF-8'

a == b
# [1] TRUE
a %in% b
# [1] FALSE

Update:

It appears the result is also platform-dependent. The two statements return:

  • TRUE and FALSE on R 3.3.0 on OS X 10.11.5
  • FALSE and FALSE on R 3.3.0 on Windows 10 (64 bit)
  • TRUE and TRUE on R 3.2.3 on CentOS 7

I'm starting to think this is a bug.

like image 930
RoyalTS Avatar asked Jun 08 '16 20:06

RoyalTS


People also ask

What are the 3 types of character encoding?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.

What is different type of encoding?

There are different types of Character Encoding techniques, which are given below: HTML Encoding. URL Encoding. Unicode Encoding.

What are the 2 most popular character encoding?

The most common ones being windows 1252 and Latin-1 (ISO-8859).

Which character encoding is best?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.


1 Answers

It is indeed a bug, and it was fixed in 3.3.1.

The behavior is actually a bit weirder than your example indicates, in that you only get FALSE when you have one element on the left-hand side of %in%:

> a %in% b
[1] FALSE
> c(a, a) %in% b
[1] TRUE TRUE

As implied by the comments, %in% just calls match, so the problem can be seen there too:

> match(a, b)
[1] NA
> match(c(a, a), b)
[1] 1 1

The important arguments to %in% and match are x and table, where either function searches for x in table. Under the hood, R does this in the match5 function defined in unique.c. In the case where you have more than one x, match5 will create a hash table from table to enable fast lookups. If you dig through the code, you'll see that the comparison is done in a function called sequal, which returns Seql(STRING_ELT(x, i), STRING_ELT(y, j)) (well, it's actually a bit more complex than this*). Then if you go look at Seql in memory.c, you'll find:

int result = !strcmp(translateCharUTF8(a), translateCharUTF8(b));

Which, as you can see, converts the strings to UTF-8.

However, if x only has one element, it's silly to go through the trouble of creating a hash table, since we can just scan through table once to see if x is there. In 3.3.0, the code to check for equality between x and each element of table didn't use Seql and didn't convert the string to UTF-8. But starting in 3.3.1, Seql is used, so the behavior is fixed.

* A little aside on string equality: R will actually cache strings so that it doesn't have to store a bunch of copies. So if two strings are at the same location, they're equal and there's no need to check further!

> .Internal(inspect("Köln"))
@10321b758 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
  @106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
> .Internal(inspect(b))
@106831cd8 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0)
  @106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
like image 135
Peyton Avatar answered Sep 22 '22 23:09

Peyton