Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is stringr changing encoding when manipulating strings?

There is this strange behavior of stringr, which is really annoying me. stringr changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim a vector of characters, then those with exotic letters will be converted to a new Encoding.

letter1 <- readline('Gimme an ASCII character!')     # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters)           # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'

This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.

Any work-around?

EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste conserves it, but not sub for instance.

 Encoding(paste(' ', Letters))                 # 'unknown'
 Encoding(str_c(' ', Letters))                 # mixed
 Encoding(sub('^ +', '', paste(' ', Letters))) # mixed
like image 513
Arthur Avatar asked Nov 02 '15 16:11

Arthur


People also ask

What encoding does r use?

Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes" . These declarations can be read by Encoding , which will return a character vector of values "latin1" , "UTF-8" "bytes" or "unknown" , or set, when value is recycled as needed and other values are silently treated as "unknown" .

How do you trim a string in R?

strtrim() function in R Language is used to trim a string to a specified display width.

How do I find a character in a string in R?

To get access to the individual characters in an R string, you need to use the substr function: str = 'string' substr(str, 1, 1) # This evaluates to 's'. For the same reason, you can't use length to find the number of characters in a string. You have to use nchar instead.


2 Answers

stringr is changing the encoding because stringr is a wrapper around the stringi package, and stringi always encodes in UTF-8. See help("stringi-encoding", package = "stringi") for details and an explanation of this design choice.

To avoid problems with merging data.tables, just make sure all the id variable(s) are encoded in UTF-8. You can do that using stri_enc_toutf8 in the stringi package, or using iconv.

like image 136
Ista Avatar answered Oct 23 '22 11:10

Ista


With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique() and duplicated().

See news item (23) under bugs for v1.9.7 in README.md.

Please test and write back if you face any further issues.

like image 30
Arun Avatar answered Oct 23 '22 09:10

Arun