This question is stemming from the following data.table
bug report - #4978, but I'm going to use a data.frame
example to illustrate that this is not a data.table
specific issue:
Consider the following:
df = data.frame(a = 1, hø = 1)
identical(names(df), c("a", "hø"))
#[1] TRUE
.Internal(inspect(names(df)))
#@0x0000000007b27458 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
# @0x000000000ee604c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
# @0x0000000007cfa910 09 CHARSXP g0c1 [gp=0x21] [cached] "hø"
.Internal(inspect(c("a", "hø")))
#@0x0000000007b274c8 16 STRSXP g0c2 [] (len=2, tl=0)
# @0x000000000ee604c0 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
# @0x0000000007cfa970 09 CHARSXP g0c1 [gp=0x24,ATT] [latin1] [cached] "hø"
Notice, that even though identical
thinks the two are identical, the underlying string cache stores the "hø" in two different places, while storing the "a" in one. What is happening? Is this an R string-caching bug?
And the reason this matters is that %chin%
fails here (because of the above discrepancy):
library(data.table)
"a" %chin% names(df)
#[1] TRUE
"hø" %chin% names(df)
#[1] FALSE
"hø"
is being marked as being in UTF-8 encoding when printed direct to the console. You can force it to be native using enc2native
and this problem disappears, however I am still working out why this is...
Encoding("hø")
# [1] "UTF-8"
.Internal( inspect( c( "a" , enc2native("hø") ) ) )
#@1081d60a0 16 STRSXP g0c2 [] (len=2, tl=0)
# @100af87d8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
# @1081e3a08 09 CHARSXP g1c1 [MARK,gp=0x21] [cached] "hø"
enc2native("hø") %chin% names(df)
#[1] TRUE
On the Encoding
help page there is a lot of relevant info, I this would be relevant:
There are other ways for character strings to acquire a declared encoding apart from explicitly setting it (and these have changed as R has evolved). Functions scan, read.table, readLines, and parse have an encoding argument that is used to declare encodings, iconv declares encodings from its from argument, and console input in suitable locales is also declared. intToUtf8 declares its output as "UTF-8", and output text connections (see textConnection) are marked if running in a suitable locale. Under some circumstances (see its help page) source(encoding=) will mark encodings of character strings it outputs.
Seems to me that anything in the basic ASCII character (character codes 0-127) set gets an "unknown"
encoding, and any characters outside of this get set to "UTF-8"
by default, including from the extended ASCII codes (character codes 128-255).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With