I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:
n_p<-'नाम'
Encoding(n_p)
gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()
Output In RStudio:
> n_p<-'नाम'
>
> Encoding(n_p)
[1] "UTF-8"
>
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 3
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-10
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)
> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 9
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
I've tried changing locales but Sys.setlocale
refuses to work properly. In some cases, gregexpr
gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.
Help.
You can run R from the command line.
On Windows, the native encoding cannot be UTF-8 nor any other that could represent all Unicode characters. Windows sometimes replaces characters by similarly looking representable ones (“best-fit”), which often works well but sometimes has surprising results, e.g. alpha character becomes letter a.
You can view or change this default in the Tools : Options (for Windows & Linux) or Preferences (for Mac) dialog, in the General section. If you don't set a default encoding, files will be opened using UTF-8 (on Mac desktop, Linux desktop, and server) or the system's default encoding (on Windows).
The right answer is that you should run Rscript with the option --encoding=file encoding
There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8: Rscript.exe --encoding=UTF-8 file.R
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With