Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Displaying UTF-8 encoded Chinese characters in R

Tags:

r

utf-8

locale

I try to open a UTF-8 encoded .csv file that contains (traditional) Chinese characters in R. For some reason, R displays the information sometimes as Chinese characters, sometimes as unicode characters.

For instance:

data <-read.csv("mydata.csv", encoding="UTF-8")

data

will produce unicode characters, while:

data <-read.csv("mydata.csv", encoding="UTF-8")

data[,1]

will actually display Chinese characters.

If I turn it into a matrix, it will also display Chinese characters, but if I try to look at the data (command View(data) or fix(data)) it is in unicode again.

I've asked for advice from people who use a Mac (I'm using a PC, Windows 7), and some of them got Chinese characters throughout, others didn't. I tried to save the original data as a table instead and read it into R this way - same result. I tried running the script in RStudio, Revolution R, and RGui. I tried to adjust the locale (e.g. to chinese), but either R didn't let me change it or else the result was gibberish instead of unicode characters.

My current locale is:

"LC_COLLATE=French_Switzerland.1252;LC_CTYPE=French_Switzerland.1252;LC_MONETARY=French_Switzerland.1252;LC_NUMERIC=C;LC_TIME=French_Switzerland.1252"

Any help to get R to consistently display Chinese characters would be greatly appreciated...

like image 584
user1445297 Avatar asked Jun 08 '12 20:06

user1445297


1 Answers

Not a bug, more a misunderstanding of the underlying type system conversions (the character type and the factor type) when constructing a data.frame.

You could start first with data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE) which will make your Chinese characters to be of the character type and so by printing them out you should see waht you are expecting.

@nograpes: similarly x=c('中華民族');x; y <- data.frame(x, stringsAsFactors=FALSE) and everything should be ok.

like image 60
jcb Avatar answered Oct 24 '22 18:10

jcb