I'm having a hard time understanding why <code>==</code> and <code>%in%</code> would produce different results when applied to character vectors that depend, it seems, only on vectors' encoding. An example: <pre class="prettyprint"><code>a <- 'Köln' Encoding(a) <- 'unknown' Encoding(a) # [1] "unknown" b <- a Encoding(b) <- 'UTF-8' a == b # [1] TRUE a %in% b # [1] FALSE </code></pre> Update: It appears the result is also platform-dependent. The two statements return: <ul> <li> <code>TRUE</code> and <code>FALSE</code> on R 3.3.0 on OS X 10.11.5</li> <li> <code>FALSE</code> and <code>FALSE</code> on R 3.3.0 on Windows 10 (64 bit)</li> <li> <code>TRUE</code> and <code>TRUE</code> on R 3.2.3 on CentOS 7</li> </ul> I'm starting to think this is a bug.

It is indeed a bug, and it was fixed in 3.3.1. The behavior is actually a bit weirder than your example indicates, in that you only get <code>FALSE</code> when you have one element on the left-hand side of <code>%in%</code>: <pre class="prettyprint"><code>> a %in% b [1] FALSE > c(a, a) %in% b [1] TRUE TRUE </code></pre> As implied by the comments, <code>%in%</code> just calls <code>match</code>, so the problem can be seen there too: <pre class="prettyprint"><code>> match(a, b) [1] NA > match(c(a, a), b) [1] 1 1 </code></pre> The important arguments to <code>%in%</code> and <code>match</code> are <code>x</code> and <code>table</code>, where either function searches for <code>x</code> in <code>table</code>. Under the hood, R does this in the <code>match5</code> function defined in <code>unique.c</code>. In the case where you have more than one <code>x</code>, <code>match5</code> will create a hash table from <code>table</code> to enable fast lookups. If you dig through the code, you'll see that the comparison is done in a function called <code>sequal</code>, which returns <code>Seql(STRING_ELT(x, i), STRING_ELT(y, j))</code> (well, it's actually a bit more complex than this*). Then if you go look at <code>Seql</code> in <code>memory.c</code>, you'll find: <pre class="prettyprint"><code>int result = !strcmp(translateCharUTF8(a), translateCharUTF8(b)); </code></pre> Which, as you can see, converts the strings to UTF-8. However, if <code>x</code> only has one element, it's silly to go through the trouble of creating a hash table, since we can just scan through <code>table</code> once to see if <code>x</code> is there. In 3.3.0, the code to check for equality between <code>x</code> and each element of <code>table</code> didn't use <code>Seql</code> and didn't convert the string to UTF-8. But starting in 3.3.1, <code>Seql</code> is used, so the behavior is fixed. * A little aside on string equality: R will actually cache strings so that it doesn't have to store a bunch of copies. So if two strings are at the same location, they're equal and there's no need to check further! <pre class="prettyprint"><code>> .Internal(inspect("Köln")) @10321b758 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0) @106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln" > .Internal(inspect(b)) @106831cd8 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0) @106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln" </code></pre>

== and %in% differ based on character encoding?

Tags:

r

character-encoding

I'm having a hard time understanding why == and %in% would produce different results when applied to character vectors that depend, it seems, only on vectors' encoding. An example:

a <- 'Köln'
Encoding(a) <- 'unknown'
Encoding(a)
# [1] "unknown"

b <- a
Encoding(b) <- 'UTF-8'

a == b
# [1] TRUE
a %in% b
# [1] FALSE

Update:

It appears the result is also platform-dependent. The two statements return:

TRUE and FALSE on R 3.3.0 on OS X 10.11.5
FALSE and FALSE on R 3.3.0 on Windows 10 (64 bit)
TRUE and TRUE on R 3.2.3 on CentOS 7

I'm starting to think this is a bug.

930

asked Jun 08 '16 20:06

RoyalTS

1 Answers

It is indeed a bug, and it was fixed in 3.3.1.

The behavior is actually a bit weirder than your example indicates, in that you only get FALSE when you have one element on the left-hand side of %in%:

> a %in% b
[1] FALSE
> c(a, a) %in% b
[1] TRUE TRUE

As implied by the comments, %in% just calls match, so the problem can be seen there too:

> match(a, b)
[1] NA
> match(c(a, a), b)
[1] 1 1

The important arguments to %in% and match are x and table, where either function searches for x in table. Under the hood, R does this in the match5 function defined in unique.c. In the case where you have more than one x, match5 will create a hash table from table to enable fast lookups. If you dig through the code, you'll see that the comparison is done in a function called sequal, which returns Seql(STRING_ELT(x, i), STRING_ELT(y, j)) (well, it's actually a bit more complex than this*). Then if you go look at Seql in memory.c, you'll find:

int result = !strcmp(translateCharUTF8(a), translateCharUTF8(b));

Which, as you can see, converts the strings to UTF-8.

However, if x only has one element, it's silly to go through the trouble of creating a hash table, since we can just scan through table once to see if x is there. In 3.3.0, the code to check for equality between x and each element of table didn't use Seql and didn't convert the string to UTF-8. But starting in 3.3.1, Seql is used, so the behavior is fixed.

* A little aside on string equality: R will actually cache strings so that it doesn't have to store a bunch of copies. So if two strings are at the same location, they're equal and there's no need to check further!

> .Internal(inspect("Köln"))
@10321b758 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
  @106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"
> .Internal(inspect(b))
@106831cd8 16 STRSXP g1c1 [MARK,NAM(2)] (len=1, tl=0)
  @106831eb8 09 CHARSXP g1c1 [MARK,gp=0x28,ATT] [UTF8] [cached] "Köln"

135

answered Sep 22 '22 23:09

Peyton

Related questions
                            
                                Line search fails in training ksvm prob.model
                            
                                Behaviour of facet_grid and scales="free" with missing data
                            
                                force boxplots from geom_boxplot to constant width
                            
                                Pie Charts in ggsubplot (ggplot2)
                            
                                Access/use R console when running a shiny app
                            
                                How to solve this error message in rmarkdown?
                            
                                Using dplyr and broom to compute kmeans on a training and test set
                            
                                Memory Leak When Opening Data Frame With RDCOMClient
                            
                                Basis provided by Ns() in R Epi package
                            
                                How to plot interaction effects from extremely large data sets (esp. from rxGlm output)
                            
                                Sliding time intervals for time series data in R
                            
                                Remove "floating" axis labels in facet_wrap plot?
                            
                                Calculating the analogue of Euler angles/Tait-Bryan angles for dimensions >3
                            
                                R: Plotting predictions of MASS polr ordinal model
                            
                                Login issue with gconnect() in gtrendsR package
                            
                                Simulating Data Efficiently with data.table
                            
                                How to keep abreast of known bugs and bug fixes in R packages?
                            
                                Increasing the plot area in ggplot to cope with geom_text at plot edges
                            
                                How to unlock environment in R?
                            
                                How can I make vim indent dplyr code with the pipe (%>%) operator correctly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With