Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: accented characters in data frame

I'm confused about why certain characters (e.g. "Ě", "Č", and "ŝ") lose their diacritical marks in a data frame, while others (e.g. "Š" and "š") do not. My OS is Windows 10, by the way. In my sample code below, a vector czechvec has 11 single-character strings, all Slavic accented characters. R displays those characters properly. Then a data frame mydf is created with czechvec as the second column (the function I() is used so it won't be converted to a factor). But then when R displays mydf or any row of mydf, it converts most of these characters to their plain-ascii equivalent; e.g. mydf[3,] shows the character as "E" not "Ě". But subscripting with row and column, e.g. mydf[3,2], it properly shows the accented character ("Ě"). Why should it make a difference whether R displays the whole row or just one cell? And why are some characters like "Š" completely unaffected? Also when I write this data frame to a file, it completely loses the accent, even though I specify fileEncoding="UTF-8".

> charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
> hexvals  <- as.hexmode(charvals)
> czechvec <- unlist(strsplit(intToUtf8(charvals), ""))
> czechvec
[1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"
> 
> mydf = data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
> mydf
   dec char  hex
1  193    Á 00C1
2  269    c 010D
3  282    E 011A
4  268    C 010C
5  262    C 0106
6  263    c 0107
7  348    S 015C
8  349    s 015D
9  350    S 015E
10 352    Š 0160
11 353    š 0161
> mydf[3,2]
[1] "Ě"
> mydf[3,]
  dec char  hex
3 282    E 011A
> 
> write.table(mydf, file="myfile.txt", fileEncoding="UTF-8")
> 
> df2 <- read.table("myfile.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")
> df2[3,2]
[1] "E"

Edited to add: Per Ernest A's answer, this behaviour is not reproducible in Linux. It must be a Windows issue. (I'm using R 3.4.1 for Windows.)

like image 980
Montgomery Clift Avatar asked Oct 29 '22 02:10

Montgomery Clift


2 Answers

I cannot reproduce this behaviour, using R version 3.3.3 (Linux).

> data.frame(dec=charvals, char=I(czechvec), hex=I(format(hexvals, width=4, upper.case=TRUE)))
   dec char  hex
1  193    Á 00C1
2  269    č 010D
3  282    Ě 011A
4  268    Č 010C
5  262    Ć 0106
6  263    ć 0107
7  348    Ŝ 015C
8  349    ŝ 015D
9  350    Ş 015E
10 352    Š 0160
11 353    š 0161
like image 179
Ernest A Avatar answered Nov 15 '22 06:11

Ernest A


Thanks to Ernest A's answer checking that the weird behaviour I observed does not occur in Linux, I Googled R WINDOWS UTF-8 BUG which led me to this article by Ista Zahn: Escaping from character encoding hell in R on Windows

The article confirms there is a bug in the data.frame print method on Windows, and gives some workarounds. (However, the article doesn't note the issue with write.table in Windows, for data frames with foreign-language text.)

One workaround suggested by Zahn is to change the locale to suit the particular language we are working with:

Sys.setlocale(category = "LC_CTYPE", locale = "czech")
charvals <- c(193, 269, 282, 268, 262, 263, 348, 349, 350, 352, 353)
hexvals  <- format(as.hexmode(charvals), width=4, upper.case=TRUE)
df1      <- data.frame(dec=charvals, char=I(unlist(strsplit(intToUtf8(charvals), ""))), hex=I(hexvals))

print.listof(df1)

dec :
 [1] 193 269 282 268 262 263 348 349 350 352 353

char :
 [1] "Á" "č" "Ě" "Č" "Ć" "ć" "Ŝ" "ŝ" "Ş" "Š" "š"

hex :
 [1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"

df1
   dec char  hex
1  193    Á 00C1
2  269    č 010D
3  282    Ě 011A
4  268    Č 010C
5  262    Ć 0106
6  263    ć 0107
7  348    S 015C
8  349    s 015D
9  350    Ş 015E
10 352    Š 0160
11 353    š 0161

Notice that the Czech characters are now displayed correctly but not "Ŝ" and "ŝ", Unicode U+015C and U+015D, which apparently are used in Esperanto. But with the print.listof command, all the characters are displayed correctly. (By the way, dput(df1) lists the Esperanto characters incorrectly, as "S" and "s".)

write.table(df1, file="special characters example.txt", fileEncoding="UTF-8")
df2 <- read.table("special characters example.txt", stringsAsFactors=FALSE, fileEncoding="UTF-8")

print.listof(df2)
dec :
 [1] 193 269 282 268 262 263 348 349 350 352 353

char :
 [1] "Á" "č" "Ě" "Č" "Ć" "ć" "S" "s" "Ş" "Š" "š"

hex :
 [1] "00C1" "010D" "011A" "010C" "0106" "0107" "015C" "015D" "015E" "0160"
[11] "0161"

When I write.table df1 and then read.table it back as df2, the "Ŝ" and "ŝ" characters have lost their circumflex. This must be a problem with the write.table command, as confirmed when I open the file with a different application such as OpenOffice Writer. The Czech characters are all there correctly, but the "Ŝ" and "ŝ" have been changed to "S" and "s".

For the time being, the best workaround for my purposes is, instead of putting the actual character in my data frame, to record the Unicode value of it, then using write.table, and using the UNICHAR function in OpenOffice Calc to add the character itself to the file. But this is inconvenient.

I believe this same bug is relevant to this question: how to read data in utf-8 format in R?

Edited to add: Other similar questions I've now found on Stack Overflow:

Why do some Unicode characters display in matrices, but not data frames in R?

UTF-8 file output in R

Write UTF-8 files from R

And I found a workaround for the display issue by Peter Meissner here:

http://r.789695.n4.nabble.com/Unicode-display-problem-with-data-frames-under-Windows-tp4707639p4707667.html

It involves defining your own class unicode_df and print function print.unicode_df.

This still does not solve the issue I have with using write.table to write my data frame (which contains some columns with text in a variety of European languages) to a file that can be imported to a spreadsheet or any arbitrary application. But perhaps Meissner's solution can be adapted to work with write.table.

like image 42
Montgomery Clift Avatar answered Nov 15 '22 07:11

Montgomery Clift