There is this strange behavior of stringr
, which is really annoying me. stringr
changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim
a vector of characters, then those with exotic letters will be converted to a new Encoding.
letter1 <- readline('Gimme an ASCII character!') # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters) # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'
This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.
Any work-around?
EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste
conserves it, but not sub
for instance.
Encoding(paste(' ', Letters)) # 'unknown'
Encoding(str_c(' ', Letters)) # mixed
Encoding(sub('^ +', '', paste(' ', Letters))) # mixed
Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes" . These declarations can be read by Encoding , which will return a character vector of values "latin1" , "UTF-8" "bytes" or "unknown" , or set, when value is recycled as needed and other values are silently treated as "unknown" .
strtrim() function in R Language is used to trim a string to a specified display width.
To get access to the individual characters in an R string, you need to use the substr function: str = 'string' substr(str, 1, 1) # This evaluates to 's'. For the same reason, you can't use length to find the number of characters in a string. You have to use nchar instead.
stringr
is changing the encoding because stringr
is a wrapper around the stringi
package, and stringi
always encodes in UTF-8. See help("stringi-encoding", package = "stringi")
for details and an explanation of this design choice.
To avoid problems with merging data.table
s, just make sure all the id
variable(s) are encoded in UTF-8. You can do that using stri_enc_toutf8
in the stringi
package, or using iconv
.
With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique()
and duplicated()
.
See news item (23) under bugs for v1.9.7 in README.md.
Please test and write back if you face any further issues.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With