I'm trying to save data extracted with <code>RSelenium</code> from https://www.magna.isa.gov.il/Details.aspx?l=he, but although R succeeds printing Hebrew character to the console, it does not when exporting TXT, CSV or in other simple R functions, like <code>data.frame()</code>, <code>readHTMLTable()</code>, etc. Here goes an example. <pre class="prettyprint"><code>> head(lines) [1] "גלובל פיננס ג'י.אר. 2 בע\"מ נתונים כספיים באלפי דולר ארה\"ב" [2] "513435404" [3] "" [4] "" [5] "" [6] "4,481" </code></pre> First line changes to weird characters (below) when using <code>data.frame()</code> <pre class="prettyprint"><code>> head(as.data.frame(lines)) [1] <U+05D2><U+05DC><U+05D5><U+05D1><U+05DC> <U+05E4><U+05D9><U+05E0><U+05E0><U+05E1> <U+05D2>'<U+05D9>.<U+05D0><U+05E8>. 2 <U+05D1><U+05E2>"<U+05DE> <U+05E0><U+05EA><U+05D5><U+05E0><U+05D9><U+05DD> <U+05DB><U+05E1><U+05E4><U+05D9><U+05D9><U+05DD> <U+05D1><U+05D0><U+05DC><U+05E4><U+05D9> <U+05D3><U+05D5><U+05DC><U+05E8> <U+05D0><U+05E8><U+05D4>"<U+05D1> </code></pre> The same happens when exporting .TXT or .CSV by <code>write.table</code> or <code>write.csv</code>: <pre class="prettyprint"><code>write.csv(lines,"lines.csv",row.names=FALSE) </code></pre> <img src="https://i.stack.imgur.com/n2RMX.png" alt="enter image description here"> I tried to change the encoding to "UTF-8", like suggested in several alike questions, yet, the issue remains in a different format: <pre class="prettyprint"><code>iconv(lines, to = "UTF-8") 1 ׳’׳׳•׳‘׳ ׳₪׳™׳ ׳ ׳¡ ׳’'׳™.׳׳¨. 2 ׳‘׳¢"׳ ׳ ׳×׳•׳ ׳™׳ ׳&rsaquo;׳¡׳₪׳™׳™׳ ׳‘׳׳׳₪׳™ ׳“׳•׳׳¨ ׳׳¨׳”"׳‘ </code></pre> Same for Hebrew ISO-8859-8: <pre class="prettyprint"><code>iconv(lines, to = "ISO-8859-8") 1 ×'×o×.×'×o ×₪×T× × ×! ×''×T.××¨. 2 ×'×¢"×z × ×a×.× ×T× ×>×!×₪×T×T× ×'××o×₪×T ×"×.×o×¨ ××¨×""×' </code></pre> I don't understand why the console prints Hebrew characters well while <code>write.table()</code>, <code>write.csv()</code> and <code>data.frame()</code> presents encoding issues. Anyone to help me exporting it? <blockquote> <blockquote> That was answered by Ken, exporting text with writeLines() worked well: </blockquote> </blockquote> <pre class="prettyprint"><code>f = file("lines.txt", open = "wt", encoding = "UTF-8") writeLines(lines, "lines.txt", useBytes = TRUE) close(f) </code></pre> Yet, the main issue R has with Hebrew encoding is while dealing with tables, in the form of as.data.frame(), write.table() and write.csv(). Any thoughts? Some machine info: <pre class="prettyprint"><code>Sys.info() sysname release version "Windows" "7 x64" "build 7601, Service Pack 1" nodename machine login "TALIS-TP" "x86" > Sys.getlocale() [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" </code></pre>

Many many people have similar problems working with UTF-8 text on platforms that have 8-bit system encodings (Windows). Encoding in R can be tricky, because different methods handle encoding and conversions differently, and what appears to work fine on one platform (OS X or Linux) works poorly on another. The problem has to do with your output connection and how Windows handles encodings and text connections. I've tried to replicate the problem using some Hebrew texts in both UTF-8 and an 8-bit encoding. We'll walk through the file reading issues as well, since there could be some snags there too. <h3>For Tests</h3> <ul> <li>Created a short Hebrew language text file, encoded as UTF-8: hebrew-utf8.txt</li> <li>Created a short Hebrew language text file, encoded as ISO-8859-8: hebrew-iso-8859-8.txt. (Note: You might need to tell your browser about the encoding in order to view this one properly - that's the case for Safari for instance.)</li> </ul> <h3>Ways to read the files</h3> Now let's experiment. I am using Windows 7 for these tests (it actually works in OS X, my usual OS). <pre class="prettyprint"><code>lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt") lines ## [1] "×”×¢×‘×¨×™ ×”×•× ×—×‘×¨ ×‘×§×‘×•×¦×” ×”×&rsaquo;× ×¢× ×™×ª ×©×&oelig; ×©×¤×•×ª ×©×ž×™×•×ª." ## [2] "×–×• ×”×™×ª×” ×©×¤×ª× ×©×&oelig; ×”×™×”×•×“×™× ×ž×•×§×“×, ××‘×&oelig; ×ž×&Yuml; 586 ×&oelig;×¤× ×”\"×¡ ×–×” ×”×ª×—×™×&oelig; ×&oelig;×”×™×•×ª ×ž×•×—×&oelig;×£ ×¢×&oelig; ×™×“×™ ×‘××¨×ž×™×ª." </code></pre> That failed because it assumed the encoding was your system encoding, Windows-1252. But because no conversion occurred when you read the files, you can fix this just by setting the Encoding bit to UTF-8: <pre class="prettyprint"><code># this sets the bit for UTF-8 Encoding(lines) <- "UTF-8" lines ## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." ## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית." </code></pre> But better to do this when you read the file: <pre class="prettyprint"><code># this does it in one pass lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8") lines2[1] ## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." Encoding(lines2) ## [1] "UTF-8" "UTF-8" </code></pre> Now look at what happens if we try to read the same text, but encoded as the 8-bit ISO Hebrew code page. <pre class="prettyprint"><code>lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt") lines3[1] ## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú." </code></pre> Setting the Encoding bit is of no help here, because what was read does not map to the Unicode code points for Hebrew, and <code>Encoding()</code> does no actual encoding conversion, it merely sets an extra bit that can be used to tell R one of a few possible encoding values. We could have solved this by adding <code>encoding = "ISO-8859-8"</code> to the <code>readLines()</code> call. We can also convert the text after loading, using <code>iconv()</code>: <pre class="prettyprint"><code># this will not fix things Encoding(lines3) <- "UTF-8" lines3[1] ## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa." # but this will iconv(lines3, "ISO-8859-8", "UTF-8")[1] ## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." </code></pre> Overall I think the method used above for <code>lines2</code> is the best approach. <h3>How to output the files, preserving encoding</h3> Now to your question about how to write this: The safest way is to control your connection at a low level, where you can specify the encoding. Otherwise, the default is for R/Windows to choose your system encoding, which will lose the UTF-8. I thought this would work, which does work absolutely fine in OS X - and on OS X also works fine calling <code>writeLines()</code> just naming a text file without the textConnection. <pre class="prettyprint"><code>## to write lines, use the encoding option of a connection object f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8") writeLines(lines2, f) close(f) </code></pre> But it does not work on Windows. You can see the Windows 7 results here: hebrew-output-UTF-8-file_encoding.txt. So, here is how to do it in Windows: Once you are sure your text is encoded as UTF-8, just write it as raw bytes, without using any encoding, like this: <pre class="prettyprint"><code>writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE) </code></pre> You can see the results at hebrew-output-UTF-8-useBytesTRUE.txt, which is now UTF-8 and looks correct. <blockquote> Added for write.csv </blockquote> Note that the only reason you would want to do this is to make the .csv file available for import into other software, such as Excel. (And good luck working with UTF-8 in Excel/Windows...) Otherwise, you should just write the data.table as binary using <code>write(myDataFrame, file = "myDataFrame.RData")</code>. But if you really need to output .csv, then: <h3>How to write UTF-8 .csv files from a <code>data.table</code> in Windows</h3> The problem with writing UTF-8 files using <code>write.table()</code> and <code>write.csv()</code> is that these open text connections, and Windows has limitations about encodings and text connections with respect to UTF-8. (This post offers a helpful explanation.) Following from an SO answer posted here, we can override this to write our own function to output UTF-8 .csv files. This assumes that you have already set the <code>Encoding()</code> for any character elements to <code>"UTF-8"</code> (which happens upon import above for <code>lines2</code>). <pre class="prettyprint"><code>df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE) write_utf8_csv <- function(df, file) { firstline <- paste('"', names(df), '"', sep = "", collapse = " , ") data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")}) writeLines(c(firstline, data), file , useBytes = TRUE) } write_utf8_csv(df, "df_csv.txt") </code></pre> When we now look at that file in non-Unicode-challenged OS, it now looks fine: <pre class="prettyprint"><code>KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt "int" , "text" "1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות." "2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית." KBsMBP15-2:Desktop kbenoit$ file df_csv.txt df_csv.txt: UTF-8 Unicode text, with CRLF line terminators </code></pre>

Hebrew Encoding Hell in R and writing a UTF-8 table in Windows

Tags:

text

r

encoding

hebrew

I'm trying to save data extracted with RSelenium from https://www.magna.isa.gov.il/Details.aspx?l=he, but although R succeeds printing Hebrew character to the console, it does not when exporting TXT, CSV or in other simple R functions, like data.frame(), readHTMLTable(), etc.

Here goes an example.

> head(lines)
[1] "גלובל פיננס ג'י.אר. 2 בע\"מ נתונים כספיים באלפי דולר ארה\"ב"
[2] "513435404"                                                  
[3] ""                                                           
[4] ""                                                           
[5] ""                                                           
[6] "4,481"

First line changes to weird characters (below) when using data.frame()

> head(as.data.frame(lines))
[1] <U+05D2><U+05DC><U+05D5><U+05D1><U+05DC> <U+05E4><U+05D9><U+05E0><U+05E0><U+05E1> <U+05D2>'<U+05D9>.<U+05D0><U+05E8>. 2 <U+05D1><U+05E2>"<U+05DE> <U+05E0><U+05EA><U+05D5><U+05E0><U+05D9><U+05DD> <U+05DB><U+05E1><U+05E4><U+05D9><U+05D9><U+05DD> <U+05D1><U+05D0><U+05DC><U+05E4><U+05D9> <U+05D3><U+05D5><U+05DC><U+05E8> <U+05D0><U+05E8><U+05D4>"<U+05D1>

The same happens when exporting .TXT or .CSV by write.table or write.csv:

write.csv(lines,"lines.csv",row.names=FALSE)

enter image description here

I tried to change the encoding to "UTF-8", like suggested in several alike questions, yet, the issue remains in a different format:

iconv(lines, to = "UTF-8")
1 ׳’׳׳•׳‘׳ ׳₪׳™׳ ׳ ׳¡ ׳’'׳™.׳׳¨. 2 ׳‘׳¢"׳ ׳ ׳×׳•׳ ׳™׳ ׳›׳¡׳₪׳™׳™׳ ׳‘׳׳׳₪׳™ ׳“׳•׳׳¨ ׳׳¨׳”"׳‘

Same for Hebrew ISO-8859-8:

iconv(lines, to = "ISO-8859-8")
    1 ×'×o×.×'×o ×₪×T× × ×! ×''×T.××¨. 2 ×'×¢"×z × ×a×.× ×T× ×>×!×₪×T×T× ×'××o×₪×T ×"×.×o×¨ ××¨×""×'

I don't understand why the console prints Hebrew characters well while write.table(), write.csv() and data.frame() presents encoding issues.

Anyone to help me exporting it?

That was answered by Ken, exporting text with writeLines() worked well:

f = file("lines.txt", open = "wt", encoding = "UTF-8")
writeLines(lines, "lines.txt", useBytes = TRUE)
close(f)

Yet, the main issue R has with Hebrew encoding is while dealing with tables, in the form of as.data.frame(), write.table() and write.csv(). Any thoughts?

Some machine info:

Sys.info()
                 sysname                      release                      version 
               "Windows"                      "7 x64" "build 7601, Service Pack 1" 
                nodename                      machine                        login 
              "TALIS-TP"                        "x86"

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

661

asked Apr 26 '16 12:04

Daniel Rabetti

1 Answers

Many many people have similar problems working with UTF-8 text on platforms that have 8-bit system encodings (Windows). Encoding in R can be tricky, because different methods handle encoding and conversions differently, and what appears to work fine on one platform (OS X or Linux) works poorly on another.

The problem has to do with your output connection and how Windows handles encodings and text connections. I've tried to replicate the problem using some Hebrew texts in both UTF-8 and an 8-bit encoding. We'll walk through the file reading issues as well, since there could be some snags there too.

For Tests

Created a short Hebrew language text file, encoded as UTF-8: hebrew-utf8.txt
Created a short Hebrew language text file, encoded as ISO-8859-8: hebrew-iso-8859-8.txt. (Note: You might need to tell your browser about the encoding in order to view this one properly - that's the case for Safari for instance.)

Ways to read the files

Now let's experiment. I am using Windows 7 for these tests (it actually works in OS X, my usual OS).

lines <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt")
lines
## [1] "×”×¢×‘×¨×™ ×”×•× ×—×‘×¨ ×‘×§×‘×•×¦×” ×”×›× ×¢× ×™×ª ×©×œ ×©×¤×•×ª ×©×ž×™×•×ª."                                                                     
## [2] "×–×• ×”×™×ª×” ×©×¤×ª× ×©×œ ×”×™×”×•×“×™× ×ž×•×§×“×, ××‘×œ ×ž×Ÿ 586 ×œ×¤× ×”\"×¡ ×–×” ×”×ª×—×™×œ ×œ×”×™×•×ª ×ž×•×—×œ×£ ×¢×œ ×™×“×™ ×‘××¨×ž×™×ª."

That failed because it assumed the encoding was your system encoding, Windows-1252. But because no conversion occurred when you read the files, you can fix this just by setting the Encoding bit to UTF-8:

# this sets the bit for UTF-8
Encoding(lines) <- "UTF-8"
lines
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."                                          
## [2] "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה\"ס זה התחיל להיות מוחלף על ידי בארמית."

But better to do this when you read the file:

# this does it in one pass
lines2 <- readLines("http://kenbenoit.net/files/hebrew-utf8.txt", encoding = "UTF-8")
lines2[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
Encoding(lines2)
## [1] "UTF-8" "UTF-8"

Now look at what happens if we try to read the same text, but encoded as the 8-bit ISO Hebrew code page.

lines3 <- readLines("http://kenbenoit.net/files/hebrew-iso-8859-8.txt")
lines3[1]
## [1] "äòáøé äåà çáø á÷áåöä äëðòðéú ùì ùôåú ùîéåú."

Setting the Encoding bit is of no help here, because what was read does not map to the Unicode code points for Hebrew, and Encoding() does no actual encoding conversion, it merely sets an extra bit that can be used to tell R one of a few possible encoding values. We could have solved this by adding encoding = "ISO-8859-8" to the readLines() call. We can also convert the text after loading, using iconv():

# this will not fix things
Encoding(lines3) <- "UTF-8"
lines3[1]
## [1] "\xe4\xf2\xe1\xf8\xe9 \xe4\xe5\xe0 \xe7\xe1\xf8 \xe1\xf7\xe1\xe5\xf6\xe4 \xe4\xeb\xf0\xf2\xf0\xe9\xfa \xf9\xec \xf9\xf4\xe5\xfa \xf9\xee\xe9\xe5\xfa."
# but this will
iconv(lines3, "ISO-8859-8", "UTF-8")[1]
## [1] "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."

Overall I think the method used above for lines2 is the best approach.

How to output the files, preserving encoding

Now to your question about how to write this: The safest way is to control your connection at a low level, where you can specify the encoding. Otherwise, the default is for R/Windows to choose your system encoding, which will lose the UTF-8. I thought this would work, which does work absolutely fine in OS X - and on OS X also works fine calling writeLines() just naming a text file without the textConnection.

## to write lines, use the encoding option of a connection object
f <- file("hebrew-output-UTF-8.txt", open = "wt", encoding = "UTF-8")
writeLines(lines2, f)
close(f)

But it does not work on Windows. You can see the Windows 7 results here: hebrew-output-UTF-8-file_encoding.txt.

So, here is how to do it in Windows: Once you are sure your text is encoded as UTF-8, just write it as raw bytes, without using any encoding, like this:

writeLines(lines2, "hebrew-output-UTF-8-useBytesTRUE.txt", useBytes = TRUE)

You can see the results at hebrew-output-UTF-8-useBytesTRUE.txt, which is now UTF-8 and looks correct.

Added for write.csv

Note that the only reason you would want to do this is to make the .csv file available for import into other software, such as Excel. (And good luck working with UTF-8 in Excel/Windows...) Otherwise, you should just write the data.table as binary using write(myDataFrame, file = "myDataFrame.RData"). But if you really need to output .csv, then:

How to write UTF-8 .csv files from a `data.table` in Windows

The problem with writing UTF-8 files using write.table() and write.csv() is that these open text connections, and Windows has limitations about encodings and text connections with respect to UTF-8. (This post offers a helpful explanation.) Following from an SO answer posted here, we can override this to write our own function to output UTF-8 .csv files.

This assumes that you have already set the Encoding() for any character elements to "UTF-8" (which happens upon import above for lines2).

df <- data.frame(int = 1:2, text = lines2, stringsAsFactors = FALSE)

write_utf8_csv <- function(df, file) {
    firstline <- paste('"', names(df), '"', sep = "", collapse = " , ")
    data <- apply(df, 1, function(x) {paste('"', x, '"', sep = "", collapse = " , ")})
    writeLines(c(firstline, data), file , useBytes = TRUE)
}

write_utf8_csv(df, "df_csv.txt")

When we now look at that file in non-Unicode-challenged OS, it now looks fine:

KBsMBP15-2:Desktop kbenoit$ cat df_csv.txt 
"int" , "text"
"1" , "העברי הוא חבר בקבוצה הכנענית של שפות שמיות."
"2" , "זו היתה שפתם של היהודים מוקדם, אבל מן 586 לפנה"ס זה התחיל להיות מוחלף על ידי בארמית."
KBsMBP15-2:Desktop kbenoit$ file df_csv.txt 
df_csv.txt: UTF-8 Unicode text, with CRLF line terminators

127

answered Sep 28 '22 12:09

Ken Benoit

Related questions
                            
                                Put a fixed title in an interactive 3D plot using rgl package, R
                            
                                Kronecker product for large matrices
                            
                                Possible to combine position_jitter with position_dodge?
                            
                                Scatter plot with ggplot2 colored by dates
                            
                                R: Dimension names in tables and multi-dimensional arrays
                            
                                BUGS error messages
                            
                                How to print three venn diagrams in the same window
                            
                                Efficient R code for finding indices associated with unique values in vector
                            
                                Combine/merge lists by elements names (list in list)
                            
                                Obtaining Separate Summary Statistics by Categorical Variable with Stargazer Package
                            
                                how to snip or crop or white-fill a large. expanded (by 10%) rectangle outside of a polygon with ggplot2
                            
                                Multiple ggplots with magrittr tee operator
                            
                                ggplot line graph with NA values
                            
                                dplyr and tail to change last value in a group_by in r
                            
                                Successfully coercing paginated JSON object to R dataframe
                            
                                How can I reduce the height of shiny input widgets?
                            
                                Fastest way of finding matching rows
                            
                                R: Efficiently remove singleton dimensions from array
                            
                                package is in use and will not be installed
                            
                                Mixing surface and scatterplot in a single 3D plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hebrew Encoding Hell in R and writing a UTF-8 table in Windows

Tags:

text

r

encoding

hebrew

Daniel Rabetti

People also ask

1 Answers

For Tests

Ways to read the files

How to output the files, preserving encoding

How to write UTF-8 .csv files from a `data.table` in Windows

Ken Benoit

Recent Activity

Donate For Us

Hebrew Encoding Hell in R and writing a UTF-8 table in Windows

Tags:

text

r

encoding

hebrew

Daniel Rabetti

People also ask

1 Answers

For Tests

Ways to read the files

How to output the files, preserving encoding

How to write UTF-8 .csv files from a data.table in Windows

Ken Benoit

Related questions

Recent Activity

Donate For Us

How to write UTF-8 .csv files from a `data.table` in Windows