Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Script with utf-8 text runs differently from RStudio and command line in Windows

I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:

n_p<-'नाम'

Encoding(n_p)

gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()

Output In RStudio:

> n_p<-'नाम'
> 
> Encoding(n_p)
[1] "UTF-8"
> 
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1

[[2]]
[1] 1
attr(,"match.length")
[1] 3

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
[5] LC_TIME=English_India.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rJava_0.9-10

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0   

Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)

> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1

[[2]]
[1] 1
attr(,"match.length")
[1] 9

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

I've tried changing locales but Sys.setlocale refuses to work properly. In some cases, gregexpr gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.

Help.

like image 499
Rohit Avatar asked Jun 08 '18 13:06

Rohit


People also ask

Can you run R scripts from command line?

You can run R from the command line.

Can Windows read UTF-8?

On Windows, the native encoding cannot be UTF-8 nor any other that could represent all Unicode characters. Windows sometimes replaces characters by similarly looking representable ones (“best-fit”), which often works well but sometimes has surprising results, e.g. alpha character becomes letter a.

How do I change the encoding in R studio?

You can view or change this default in the Tools : Options (for Windows & Linux) or Preferences (for Mac) dialog, in the General section. If you don't set a default encoding, files will be opened using UTF-8 (on Mac desktop, Linux desktop, and server) or the system's default encoding (on Windows).


1 Answers

The right answer is that you should run Rscript with the option --encoding=file encoding

There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8: Rscript.exe --encoding=UTF-8 file.R

like image 166
Leonardo Motta Avatar answered Oct 05 '22 22:10

Leonardo Motta