Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 file output in R

Tags:

r

unicode

cjk

I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file.

The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected:

rty <- file("test.txt",encoding="UTF-8")
write("在", file=rty)
close(rty)
rty <- file("test.txt",encoding="UTF-8")
scan(rty,what=character())
close(rty)

As shown by the output of scan:

Read 1 item 
[1] "<U+5728>"

The file was not written with the UTF character itself, but some kind of ANSI-compliant fallback. Can I make it work right the first time (i.e. with a text file that has "在" in it instead), or can I work some extra magic to convert the output to Unicode with the proper character replacing the code string?

Thanks.

[More info: the same code behaves properly in Cygwin, R 2.14.2, while 2.14.2 on Win7 is also broken. Is this on my end somewhere?]

like image 273
Patrick Avatar asked May 20 '12 16:05

Patrick


People also ask

How do I view a UTF-8 file?

To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.

What is the UTF-8 values?

UTF-8 Basics. UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.

What is the default encoding for R?

Setting the Default Encoding If you don't set a default encoding, files will be opened using UTF-8 (on Mac desktop, Linux desktop, and server) or the system's default encoding (on Windows). When saving a previously unsaved file, RStudio will ask you to choose an encoding if non-ASCII characters are present.


2 Answers

The problem is due to some R-Windows special behaviour (using the default system coding / or using some system write functions; I do not know the specifics but the behaviour is actually known)

To write text UTF8 encoding on Windows one has to use the useBytes=T options in functions like writeLines or readLines:

txt <- "在"
writeLines(txt, "test.txt", useBytes=T)

readLines("test.txt", encoding="UTF-8")
[1] "在"

Find here a really well written article by Kevin Ushey: http://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/ going into much more detail.

like image 149
petermeissner Avatar answered Sep 22 '22 03:09

petermeissner


Saves UTF-8 strings in text file:

kLogFileName <- "parser.log"
log <- function(msg="") {
  con <- file(kLogFileName, "a")
  tryCatch({
    cat(iconv(msg, to="UTF-8"), file=con, sep="\n")
  },
  finally = {
    close(con)
  })
}
like image 23
beloblotskiy Avatar answered Sep 18 '22 03:09

beloblotskiy