Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't R data.table support well for non ASCII keys on Windows

Well, I've filed the issue on Github but get no response. data.table is a great R package that helps us a lot in daily work.

However, after the version 1.9.6, it suddenly doesn't support non-ASCII keys on windows, if the column is not encoded in UTF-8 (the default non-ASCII characters encoding in R depends on the platform).

It's highly probably a bug (and a big bug I would say). I'm surprised that no one pays attention to this and there's no one to complain since the bug has existed for almost 2 years.

I've spent hours tried to solve the issue but failed. The related commits are https://github.com/Rdatatable/data.table/commit/03cd45f83fe41e4a6507b9b2e4f955c105979c8c and https://github.com/Rdatatable/data.table/commit/409d709380e865d014f21f17a254e0bbcf1e156d

They are actually trying to convert other encoding characters to UTF-8, then sort and compare all the characters in UTF-8. It seems like the encoding handling is correct. However, I do suspect the bug is hided there. The implementation of data.table is really complex, I'm asking if anyone can help so that we can make a PR to settle this down.

Thanks very much.

Minimal reproducible example

Dataset

library(data.table)
## data.table 1.10.5 IN DEVELOPMENT built 2017-12-01 20:06:10 UTC
## The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
##  Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
##  Release notes, videos and slides: http://r-datatable.com
dt <- data.table(
  x = c("公允价值变动损益", "红利收入", "价差收入", "其他业务支出", "资产减值损失"),
  y = 1:5,
  key = "x"
)

Will fail (returns NA) if the encoding is native

dt[]
##                   x y
## 1: 公允价值变动损益 1
## 2:         红利收入 2
## 3:         价差收入 3
## 4:     其他业务支出 4
## 5:     资产减值损失 5
Encoding(dt$x) 
## [1] "unknown" "unknown" "unknown" "unknown" "unknown"
dt[J("公允价值变动损益")][]
##                   x  y
## 1: 公允价值变动损益 NA

Will succeed only if the encoding is converted to utf8

Now it returns the correct answer 1. Note the dt's order now also becomes different, which is not supposed to happen.

dt[, x := enc2utf8(x)]
setkey(dt, x)

dt[]
##                   x y
## 1:         价差收入 3
## 2: 公允价值变动损益 1
## 3:     其他业务支出 4
## 4:         红利收入 2
## 5:     资产减值损失 5
Encoding(dt$x)
## [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
dt[J("公允价值变动损益")][]
##                   x y
## 1: 公允价值变动损益 1

sessionInfo

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.5
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.1     htmltools_0.3.6 Rcpp_0.12.13    stringi_1.1.5  
##  [9] rmarkdown_1.8   knitr_1.17      stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10.1
like image 508
Shrek Tan Avatar asked Dec 01 '17 19:12

Shrek Tan


1 Answers

I'm anwsering my own question to close it since this issue has been solved in PR.

For strings, data.table compare their values in UTF8 encoding. However, due to missing two ENC2UTF8 in csort() and csort_pre(), the order that data.table creates actually depends on the encoding. On Windows, the fact that the default encoding is not UTF8 leads to some weird output when there're strings in keys.

In order to debug this case, you need to know how to print the non-ASCII characters from C routine to R's output. Using Rprintf() directly you will get a mess. You have to use translateChar() on the string first.

References:

  • http://r.789695.n4.nabble.com/Rprintf-expected-encoding-td4740717.html
  • http://r.789695.n4.nabble.com/How-to-print-UTF-8-encoded-strings-from-a-C-routine-to-R-s-output-td4724337.html
like image 189
Shrek Tan Avatar answered Oct 22 '22 05:10

Shrek Tan