Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write Unicode string to text file in R Windows?

I've figured how to write Unicode strings, but still puzzled by why it works.

str <- "ỏ"
Encoding(str) # UTF-8
cat(str, file="no-iconv") # Written wrongly as <U+1ECF>
cat(iconv(str, to="UTF-8"), file="yes-iconv") # Written correctly as ỏ

I understand why the no-iconv approach does not work. It's because cat (and writeLines as well) convert the string into the native encoding first and then to the to= encoding. On windows, this means R converts to Windows-1252 first, which cannot understand , resulting in <U+1ECF>.

What I don't understand is why the yes-iconv approach works. If I understand correctly, what iconv does here is simply to return a string with the UTF-8 encoding. But str is already in UTF-8! Why should iconv make any difference? In addition, when iconv(str, to="UTF-8") is passed to cat, shouldn't cat mess everything up once again by first converting to Windows-1252?

like image 563
Heisenberg Avatar asked Jul 07 '16 04:07

Heisenberg


People also ask

How do I UTF-8 encode a text file?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

Do txt files support Unicode?

txt uses Unicode/UTF-8" is the Byte Order Mark at the beginning of the text file. By the way it is represented in actual bytes, it tells the reader which Unicode encoding to use to read the file.


1 Answers

I think setting the Encoding of (a copy of) str to "unknown" before using cat() is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat().

Here is an expanded example to demonstrate what I think happens in the original example:

print_info <- function(x) {
    print(x)
    print(Encoding(x))
    str(x)
    print(charToRaw(x))
}

cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")

cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")

cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")

cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")

In a "Latin-1" locale (see ?l10n_info) as used by R on Windows, output files "yes-iconv", "latin" and "unknown" should be correct (byte sequence 0xe1, 0xbb, 0x8f which is "ỏ").

In a "UTF-8" locale, files "no-iconv" and "unknown" should be correct.

The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:

(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
 chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f

(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
 chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f

(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
 chr "á»"
[1] e1 bb 8f

(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
 chr "á»"
[1] e1 bb 8f

In the original example, iconv() uses the default from = "" argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat() when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).

like image 188
mvkorpel Avatar answered Oct 19 '22 01:10

mvkorpel