Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

internal string caching in R

This question is stemming from the following data.table bug report - #4978, but I'm going to use a data.frame example to illustrate that this is not a data.table specific issue:

Consider the following:

df = data.frame(a = 1, hø = 1)

identical(names(df), c("a", "hø"))
#[1] TRUE

.Internal(inspect(names(df)))
#@0x0000000007b27458 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
#  @0x000000000ee604c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
#  @0x0000000007cfa910 09 CHARSXP g0c1 [gp=0x21] [cached] "hø"

.Internal(inspect(c("a", "hø")))
#@0x0000000007b274c8 16 STRSXP g0c2 [] (len=2, tl=0)
#  @0x000000000ee604c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
#  @0x0000000007cfa970 09 CHARSXP g0c1 [gp=0x24,ATT] [latin1] [cached] "hø"

Notice, that even though identical thinks the two are identical, the underlying string cache stores the "hø" in two different places, while storing the "a" in one. What is happening? Is this an R string-caching bug?

And the reason this matters is that %chin% fails here (because of the above discrepancy):

library(data.table)
"a" %chin% names(df)
#[1] TRUE
"hø" %chin% names(df)
#[1] FALSE
like image 729
eddi Avatar asked Oct 08 '13 20:10

eddi


1 Answers

"hø" is being marked as being in UTF-8 encoding when printed direct to the console. You can force it to be native using enc2native and this problem disappears, however I am still working out why this is...

Encoding("hø")
# [1] "UTF-8"

.Internal( inspect( c( "a" , enc2native("hø") ) ) )
#@1081d60a0 16 STRSXP g0c2 [] (len=2, tl=0)
#  @100af87d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
#  @1081e3a08 09 CHARSXP g1c1 [MARK,gp=0x21] [cached] "hø"

enc2native("hø") %chin% names(df)
#[1] TRUE

On the Encoding help page there is a lot of relevant info, I this would be relevant:

There are other ways for character strings to acquire a declared encoding apart from explicitly setting it (and these have changed as R has evolved). Functions scan, read.table, readLines, and parse have an encoding argument that is used to declare encodings, iconv declares encodings from its from argument, and console input in suitable locales is also declared. intToUtf8 declares its output as "UTF-8", and output text connections (see textConnection) are marked if running in a suitable locale. Under some circumstances (see its help page) source(encoding=) will mark encodings of character strings it outputs.

Update

Seems to me that anything in the basic ASCII character (character codes 0-127) set gets an "unknown" encoding, and any characters outside of this get set to "UTF-8" by default, including from the extended ASCII codes (character codes 128-255).

like image 105
Simon O'Hanlon Avatar answered Nov 16 '22 18:11

Simon O'Hanlon