Why is stringr changing encoding when manipulating strings?

Tags:

There is this strange behavior of stringr, which is really annoying me. stringr changes without a warning the encoding of some strings that contain exotic characters, in my case ø, å, æ, é and some others... If you str_trim a vector of characters, then those with exotic letters will be converted to a new Encoding.

letter1 <- readline('Gimme an ASCII character!')     # try q or a
letter2 <- readline('Gimme an non-ASCII character!') # try ø or é
Letters <- c(letter1, letter2)
Encoding(Letters)           # 'unknown'
Encoding(str_trim(Letters)) # mixed 'unknown' and 'UTF-8'

This is a problem because I use data.table for (fast) merge of big tables and that data.table does not support mixed encoding and because I could not find a way to get back to the uniform encoding.

Any work-around?

EDIT: i thought I could get back to the base functions, but they don't either protect encoding. paste conserves it, but not sub for instance.

 Encoding(paste(' ', Letters))                 # 'unknown'
 Encoding(str_c(' ', Letters))                 # mixed
 Encoding(sub('^ +', '', paste(' ', Letters))) # mixed

513

asked Nov 02 '15 16:11

Arthur

2 Answers

stringr is changing the encoding because stringr is a wrapper around the stringi package, and stringi always encodes in UTF-8. See help("stringi-encoding", package = "stringi") for details and an explanation of this design choice.

To avoid problems with merging data.tables, just make sure all the id variable(s) are encoded in UTF-8. You can do that using stri_enc_toutf8 in the stringi package, or using iconv.

136

answered Oct 23 '22 11:10

Ista

With this recent commit, data.table now takes care of these mixed encodings implicitly by ensuring proper encodings while creating data.tables, as well as by ensuring proper encodings in functions like unique() and duplicated().

See news item (23) under bugs for v1.9.7 in README.md.

Please test and write back if you face any further issues.

answered Oct 23 '22 09:10

Arun

Related questions
                            
                                To access S3 bucket from R
                            
                                building classification tree having categorical variables using rpart
                            
                                Force certain parameters to have positive coefficients in lm()
                            
                                How to rank rows by two columns at once in R?
                            
                                Websites explicitly designed for testing Web Scraping applications [closed]
                            
                                data.table computing several column at once
                            
                                which.min by row without apply
                            
                                Change number format in renderDataTable
                            
                                pass grouped dataframe to own function in dplyr
                            
                                Rcurl: url.exists returns false when url does exist
                            
                                R: Problems with unloadNamespace(package) when installing a package
                            
                                R: read.csv.sql from sqldf is able to successfully read one csv but not another
                            
                                ggplot heatmap failing to fill tiles
                            
                                Equal distance among all points on an axis
                            
                                Copy/paste table into gmail
                            
                                splitting or removing graphs after arrangeGrob
                            
                                Change label in ggpairs upper panel
                            
                                How can I make my Shiny leafletOutput have height="100%" while inside a navbarPage?
                            
                                Reading timestamp data in R from multiple time zones
                            
                                Could not find function "OlsonNames" when using read_csv with readr package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is stringr changing encoding when manipulating strings?

Tags:

r

encoding

data.table

stringr

Arthur

People also ask

2 Answers

Ista

Arun

Recent Activity

Donate For Us