Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading special characters like ÆØÅ into R (Rstudio)

I am trying to read a CSV file containing questionnaire data written in Norwegian. So this file contains the letters Æ Ø Å, however R does not seem to handle these letters well, they all appear as question marks.

I use this to read the data:

data <- read.csv2("Responser - Vasket - 20.06.2013.csv")

Is there any options I should use to let R know I have special characters?

and I am using Rstudio on Windows 7.

like image 539
Ole Henrik Skogstrøm Avatar asked Jun 24 '13 09:06

Ole Henrik Skogstrøm


3 Answers

You need to specify the fileEncoding argument to read.csv2 (not as well as (?) the encoding).


Before you get to R, it is a good idea to check what the encoding of the file is using a text editor. For example, if you open a file in Notepad++, the Encoding menu lets you view and change the character encoding. In TextPad, you can change the encoding from the Save As.. dialog box. Most text editors will have such a feature.

This is the value you need to pass to fileEncoding; you can't just declare a file to be UTF-16 if it isn't already. That's why you had a warning.

like image 109
Richie Cotton Avatar answered Nov 20 '22 10:11

Richie Cotton


Given my R version and settings, this works for me:
In Notepad, I check that the csv-file is saved with 'Encoding: ANSI'.
In RStudio: Tools / Options / Default text encoding: ISO8859-1

I tried with dummy data like this:

dd <- data.frame(area = c("øø", "åå", "ææ"), site = c("åå", "ææ", "øø")) 
write.csv2(x = dd, file = "åæø.csv", row.names = FALSE)
dd2 <- read.csv2(file = "åæø.csv")
all.equal(dd, dd2)

sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252  LC_CTYPE=Norwegian (Bokmål)_Norway.1252   
[3] LC_MONETARY=Norwegian (Bokmål)_Norway.1252 LC_NUMERIC=C                              
[5] LC_TIME=Norwegian (Bokmål)_Norway.1252


getOption("encoding")
[1] "native.enc"

Edit following comment from @Ole Henrik Skogstrøm Aug 7 at 7:57

The comment "if i ...use the view command in Rstudio this error still persists" and "if i just type it out and put the result in the console it works" from @Ole Henrik Skogstrøm revealed that the information given in the original post was not sufficient.

My answer above works for the original question actually asked: reading special characters into R. What does not work, and which was not specified in the OP, is that 'View-ing' the object in RStudio displays æøå incorrectly. Both when running View(dd) (dd, see dummy data above) from the console, and when clicking on the object in the 'Workspace pane', æøå is displayed as "black diamond question mark" in the data viewer.

On the other hand, if you use the RGui only, without using RStudio, View(dd) displays the characters correctly in the data viewer.

Thus, rather than a problem with reading æøå into R, this seems to be an issue with View-ing them in RStudio. See also this post on RStudio support.

like image 3
Henrik Avatar answered Nov 20 '22 09:11

Henrik


Hei Henrik, I had the same problem when csv files (produced by Excel) containing Ø Æ Å, where opened in R they would display the Norwegian letters as a black diamond with a white question mark in the middle. For me the problem was definitely encoding based, however I couldn't successfully use "encoding" or "fileEncoding" to open them properly.

I solved the problem on my system by opening the csv in notepad then converting it to a text file and changing the encoding from "ANSI" to "UTF-8". See the example.

The below link contains two csv files created by excel, one encoded in MS-DOS encoding (Names CSV MSDOS) the other encoded in the "Comma delimited" (Names CSV) style

https://drive.google.com/folderview?id=0BzoGQiFdDwiNNm02UnNLVVNja3c&usp=sharing

opening them both in Notepad should show that the MS-DOS version has incorrect representation of the letters (and so can be ignored) whilst the "comma delimited" version has the correct representation. Save the "Names CSV" file as a text file with encoding "UTF-8" and called "Names CSV UTF8". Set your working directory in R to the folder where the files are located and run the following code.

test1 <- read.csv2("Names CSV.csv")
test2 <- read.csv2("Names CSV UTF8.txt")

test1 should display the black diamond with question mark test2 should display the Names correctly.

I think the previous Answers may not have worked because the table was being created by R with R setting the character encoding, where as the problem I have had and I believe you as well is that a different software of system has been setting the encoding.

This solution isn't very useful if you have lots of files to work on but it is a least a start.

like image 2
Jonno Bourne Avatar answered Nov 20 '22 08:11

Jonno Bourne