Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force character vector encoding from "unknown" to "UTF-8" in R

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++) in UTF-8 (I tried with UTF-8 without BOM, too.).

I want to read table from this text file, convert it do data.table, set a key and make use of binary search. When I tried to do so, the following appeared:

Warning message: In [.data.table(poli.dt, "żżonymi", mult = "first") : A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares the bytes currently, so doesn't support mixed encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and others not. But if either latin1 or UTF-8 is used exclusively, and all unknown encodings are ascii, then the result should be ok. In future we will check for you and avoid this warning if everything is ok. The tricky part is doing this without impacting performance for ascii-only cases.

and binary search does not work.

I realised that my data.table-key column consists of both: "unknown" and "UTF-8" Encoding types:

> table(Encoding(poli.dt$word)) unknown   UTF-8  2061312 2739122  

I tried to convert this column (before creating a data.table object) with the use of:

  • Encoding(word) <- "UTF-8"
  • word<- enc2utf8(word)

but with no effect.

I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. encoding = "UTF-8"):

  • data.table::fread
  • utils::read.table
  • base::scan
  • colbycol::cbc.read.table

but with no effect.

==================================================

My R.version:

> R.version            _                            platform       x86_64-w64-mingw32           arch           x86_64                       os             mingw32                      system         x86_64, mingw32              status                                      major          3                            minor          0.3                          year           2014                         month          03                           day            06                           svn rev        65126                        language       R                            version.string R version 3.0.3 (2014-03-06) nickname       Warm Puppy   

My session info:

> sessionInfo() R version 3.0.3 (2014-03-06) Platform: x86_64-w64-mingw32/x64 (64-bit)  locale: [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250 [4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250      base packages: [1] stats     graphics  grDevices utils     datasets  methods   base       other attached packages: [1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6       loaded via a namespace (and not attached): [1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3    
like image 542
Marta Karas Avatar asked May 16 '14 15:05

Marta Karas


1 Answers

The Encoding function returns unknown if a character string has a "native encoding" mark (CP-1250 in your case) or if it's in ASCII. To discriminate between these two cases, call:

library(stringi) stri_enc_mark(poli.dt$word) 

To check whether each string is a valid UTF-8 byte sequence, call:

all(stri_enc_isutf8(poli.dt$word)) 

If it's not the case, your file is definitely not in UTF-8.

I suspect that you haven't forced the UTF-8 mode in the data read function (try inspecting the contents of poli.dt$word to verify this statement). If my guess is true, try:

read.csv2(file("filename", encoding="UTF-8")) 

or

poli.dt$word <- stri_encode(poli.dt$word, "", "UTF-8") # re-mark encodings 

If data.table still complains about the "mixed" encodings, you may want to transliterate the non-ASCII characters, e.g.:

stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII") ## [1] "Zazolc gesla jazn" 
like image 79
gagolews Avatar answered Oct 05 '22 10:10

gagolews