I've figured how to write Unicode strings, but still puzzled by why it works. <pre class="prettyprint"><code>str <- "ỏ" Encoding(str) # UTF-8 cat(str, file="no-iconv") # Written wrongly as <U+1ECF> cat(iconv(str, to="UTF-8"), file="yes-iconv") # Written correctly as ỏ </code></pre> I understand why the <code>no-iconv</code> approach does not work. It's because <code>cat</code> (and <code>writeLines</code> as well) convert the string into the native encoding first and then to the <code>to=</code> encoding. On windows, this means R converts <code>ỏ</code> to <code>Windows-1252</code> first, which cannot understand <code>ỏ</code>, resulting in <code><U+1ECF></code>. What I don't understand is why the <code>yes-iconv</code> approach works. If I understand correctly, what <code>iconv</code> does here is simply to return a string with the <code>UTF-8</code> encoding. But <code>str</code> is already in <code>UTF-8</code>! Why should <code>iconv</code> make any difference? In addition, when <code>iconv(str, to="UTF-8")</code> is passed to <code>cat</code>, shouldn't <code>cat</code> mess everything up once again by first converting to <code>Windows-1252</code>?

I think setting the Encoding of (a copy of) <code>str</code> to <code>"unknown"</code> before using <code>cat()</code> is less magic and works just as well. I think that should avoid any unwanted character set conversions in <code>cat()</code>. Here is an expanded example to demonstrate what I think happens in the original example: <pre class="prettyprint"><code>print_info <- function(x) { print(x) print(Encoding(x)) str(x) print(charToRaw(x)) } cat("(1) Original string (UTF-8)\n") str <- "\xe1\xbb\x8f" Encoding(str) <- "UTF-8" print_info(str) cat(str, file="no-iconv") cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n") ## from = "" is conversion from current locale, forcing "latin1" here str2 <- iconv(str, from="latin1", to="UTF-8") print_info(str2) cat(str2, file="yes-iconv") cat("\n(3) Converting (2) explicitly to latin1\n") str3 <- iconv(str2, from="UTF-8", to="latin1") print_info(str3) cat(str3, file="latin") cat("\n(4) Setting encoding of (1) to \"unknown\"\n") str4 <- str Encoding(str4) <- "unknown" print_info(str4) cat(str4, file="unknown") </code></pre> In a <code>"Latin-1"</code> locale (see <code>?l10n_info</code>) as used by R on Windows, output files <code>"yes-iconv"</code>, <code>"latin"</code> and <code>"unknown"</code> should be correct (byte sequence <code>0xe1</code>, <code>0xbb</code>, <code>0x8f</code> which is <code>"ỏ"</code>). In a <code>"UTF-8"</code> locale, files <code>"no-iconv"</code> and <code>"unknown"</code> should be correct. The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine: <pre class="prettyprint"><code>(1) Original string (UTF-8) [1] "ỏ" [1] "UTF-8" chr "<U+1ECF>""| __truncated__ [1] e1 bb 8f (2) Conversion to UTF-8, wrong input encoding (latin1) [1] "á»\u008f" [1] "UTF-8" chr "á»\u008f" [1] c3 a1 c2 bb c2 8f (3) Converting (2) explicitly to latin1 [1] "á»" [1] "latin1" chr "á»" [1] e1 bb 8f (4) Setting encoding of (1) to "unknown" [1] "á»" [1] "unknown" chr "á»" [1] e1 bb 8f </code></pre> In the original example, <code>iconv()</code> uses the default <code>from = ""</code> argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of <code>str</code> is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by <code>cat()</code> when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).

How to write Unicode string to text file in R Windows?

Tags:

r

encoding

unicode

utf-8

I've figured how to write Unicode strings, but still puzzled by why it works.

Click to copy

str <- "ỏ"
Encoding(str) # UTF-8
cat(str, file="no-iconv") # Written wrongly as <U+1ECF>
cat(iconv(str, to="UTF-8"), file="yes-iconv") # Written correctly as ỏ

I understand why the no-iconv approach does not work. It's because cat (and writeLines as well) convert the string into the native encoding first and then to the to= encoding. On windows, this means R converts ỏ to Windows-1252 first, which cannot understand ỏ, resulting in <U+1ECF>.

What I don't understand is why the yes-iconv approach works. If I understand correctly, what iconv does here is simply to return a string with the UTF-8 encoding. But str is already in UTF-8! Why should iconv make any difference? In addition, when iconv(str, to="UTF-8") is passed to cat, shouldn't cat mess everything up once again by first converting to Windows-1252?

563

asked Jul 07 '16 04:07

Heisenberg

1 Answers

I think setting the Encoding of (a copy of) str to "unknown" before using cat() is less magic and works just as well. I think that should avoid any unwanted character set conversions in cat().

Here is an expanded example to demonstrate what I think happens in the original example:

Click to copy

print_info <- function(x) {
    print(x)
    print(Encoding(x))
    str(x)
    print(charToRaw(x))
}

cat("(1) Original string (UTF-8)\n")
str <- "\xe1\xbb\x8f"
Encoding(str) <- "UTF-8"
print_info(str)
cat(str, file="no-iconv")

cat("\n(2) Conversion to UTF-8, wrong input encoding (latin1)\n")
## from = "" is conversion from current locale, forcing "latin1" here
str2 <- iconv(str, from="latin1", to="UTF-8")
print_info(str2)
cat(str2, file="yes-iconv")

cat("\n(3) Converting (2) explicitly to latin1\n")
str3 <- iconv(str2, from="UTF-8", to="latin1")
print_info(str3)
cat(str3, file="latin")

cat("\n(4) Setting encoding of (1) to \"unknown\"\n")
str4 <- str
Encoding(str4) <- "unknown"
print_info(str4)
cat(str4, file="unknown")

In a "Latin-1" locale (see ?l10n_info) as used by R on Windows, output files "yes-iconv", "latin" and "unknown" should be correct (byte sequence 0xe1, 0xbb, 0x8f which is "ỏ").

In a "UTF-8" locale, files "no-iconv" and "unknown" should be correct.

The output of the example code is as follows, using R 3.3.2 64-bit Windows version running on Wine:

Click to copy

(1) Original string (UTF-8)
[1] "ỏ"
[1] "UTF-8"
 chr "<U+1ECF>""| __truncated__
[1] e1 bb 8f

(2) Conversion to UTF-8, wrong input encoding (latin1)
[1] "á»\u008f"
[1] "UTF-8"
 chr "á»\u008f"
[1] c3 a1 c2 bb c2 8f

(3) Converting (2) explicitly to latin1
[1] "á»"
[1] "latin1"
 chr "á»"
[1] e1 bb 8f

(4) Setting encoding of (1) to "unknown"
[1] "á»"
[1] "unknown"
 chr "á»"
[1] e1 bb 8f

In the original example, iconv() uses the default from = "" argument which means conversion from the current locale, which is effectively "latin1". Because the encoding of str is actually "UTF-8", the byte representation of the string is distorted in step (2), but then implicitly restored by cat() when it (presumably) converts the string back to the current locale, as demonstrated by the equivalent conversion in step (3).

188

answered Oct 19 '22 01:10

mvkorpel

Related questions
                            
                                How to toggle roxygen comments in Rstudio?
                            
                                Converting R formula format to mathematical equation
                            
                                Using facet_grid and facet_wrap Together
                            
                                Why does peak memory usage increase when there are more elements to loop/apply over?
                            
                                Inconsistency with R's Global Environment in a function call
                            
                                in Q, how to speed up unicoin mining? [closed]
                            
                                How to access a return value of a function that is being traced
                            
                                ggplot2: How to get merge functionality of facet_grid()'s labeller=label_both and facet_wrap()'s ncol options?
                            
                                Print all significant digits in sprintf scientific notation
                            
                                Error BTYD: pnbd.EstimateParameters: L-BFGS-B needs finite values of 'fn'
                            
                                R: stack overflow error with randomForest on large dataset (48-512 GB RAM)
                            
                                How to convert a data frame of integer64 values to be a matrix?
                            
                                Specify Font type on R Markdown
                            
                                How to deal with ggplot2 and overlapping labels on a discrete axis
                            
                                How to create a large data frame in R with or without creating a matrix first and then converting it to a data.frame?
                            
                                fitting a linear mixed model to a very large data set
                            
                                Sublime 3 not interfacing with R (tried R-box and REPL)
                            
                                Idiom for conditionally selecting columns from a data.table
                            
                                Error in grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : polygon edge not found (new)
                            
                                How to access the script/source history in RStudio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write Unicode string to text file in R Windows?

Tags:

r

encoding

unicode

utf-8

Heisenberg

People also ask

1 Answers

mvkorpel

Recent Activity

Donate For Us