My system:win7+R-3.0.2. <pre class="prettyprint"><code>> Sys.getlocale() [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936" </code></pre> There are two files with same content saved in microsoft notepad: one is saved as ansi format, the other is saved as utf8 format.The data is death name in M370 Malaysia Airlines . Or you can create the file this way. 1)copy the data into microsoft notepad. <pre class="prettyprint"><code>乘客姓名,性别,出生日期 HuangTianhui,男,1948/05/28 姜翠云,女,1952/03/27 李红晶,女,1994/12/09 </code></pre> 2)save it as test.ansi with ansi format in notepad. 3)save it as test.utf8 with utf-8 format in notepad. <pre class="prettyprint"><code>read.table("test.ansi",sep=",",header=TRUE) #can work fine read.table("test.utf8",sep=",",header=TRUE) #can't work </code></pre> Then, i set encoding into utf-8. <pre class="prettyprint"><code>options(encoding="utf-8") read.table("test.utf8",sep=",",header=TRUE,encoding="utf-8") In read.table("test.utf8", sep = ",",header=TRUE,encoding = "utf-8") : invalid input found on input connection 'test.utf8' </code></pre> How can I read the data file (test.utf8)? In python,it is so simple <pre class="prettyprint"><code>rfile=open("g:\\test.utf8","r",encoding="utf-8").read() rfile '\ufeff乘客姓名,性别,出生日期\n\nHuangTianhui,男,1948/05/28\n\n姜翠云,女,1952/03 /27\n\n李红晶,女,1994/12/09' rfile.replace("\n\n","\n").replace("\ufeff","").splitlines() ['乘客姓名,性别,出生日期', 'HuangTianhui,男,1948/05/28', '姜翠云,女,1952/03/27', '李红晶,女,1994/12/09'] </code></pre> Python can do such job better than R. I do as Sathish say, problem solved a little ,still remain some. I found that when the data is in data.frame ,it can not be displayed properly, when the data is a column of data.frame ,it can be displayed properly, strange enough,when the data is a row of data.frame,it can not be displayed properly . <img src="https://i.stack.imgur.com/39svc.jpg" alt="enter image description here"> <img src="https://i.stack.imgur.com/lOky3.jpg" alt="enter image description here">

OS: Windows-7 (64-bit) R Version: <pre class="prettyprint"><code>package_version(R.version) [1] ‘3.0.2’ </code></pre> Change your locale from "chinese" to "English_United States.1252" <pre class="prettyprint"><code> Sys.setlocale(category="LC_ALL", locale = "English_United States.1252") Sys.getlocale(category="LC_ALL") [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252" </code></pre> Read in data with chinese encoding <pre class="prettyprint"><code> df_ch <- read.table("test.utf8", sep=",", header=FALSE, encoding="chinese", stringsAsFactors=FALSE ) </code></pre> Read in data with UTF-8 encoding <pre class="prettyprint"><code> df_utf8 <- read.table("test.utf8", sep=",", header=FALSE, encoding="UTF-8", stringsAsFactors=FALSE ) </code></pre> <h3>In RStudio Version 0.98.501</h3> <pre class="prettyprint"><code> df_ch$V1[1] [1] "ï»¿ä¹&tilde;å®¢å§“å" df_utf8$V1[1] [2] "乘客姓名" df_utf8$V1 [1] "乘客姓名" "HuangTianhui" "姜翠云" "李红晶" "LuiChing" "宋飞飞" [7] "唐旭东" "YangJiabao" "买买提江·阿布拉" "安文兰" "鲍媛华" "边亮京" [13] "边茂勤" "曹蕊" "车俊章" "陈长军" "陈建设" "陈昀" [19] "戴淑玲" "丁立军" "丁莹" "丁颖" "董国伟" "杜文忠" [25] "冯栋" "冯纪新" "付宝峰" "甘福祥" "甘涛" "高歌" [31] "管文杰" "韩静" "侯爱琴" "侯波" "胡偲婠(婴儿)" "胡效宁" </code></pre> Display unicode data for a row from a data frame <pre class="prettyprint"><code> df_utf8[1,] V1 V2 V3 1 <U+FEFF><U+4E58><U+5BA2><U+59D3><U+540D> <U+6027><U+522B> <U+51FA><U+751F><U+65E5><U+671F> </code></pre> Display chinese data for a row from a data frame <pre class="prettyprint"><code>as.character(df_utf8[1,]) [1] "乘客姓名" "性别" "出生日期" as.character(df_utf8[2,]) [1] "HuangTianhui" "男" "1948/05/28" </code></pre> Displaying multiple columns of data with international characters can be performed by converting the data frame into list and by forcing the data into character format. <pre class="prettyprint"><code> df_utf8_ch <- lapply(df_utf8, as.character) df_utf8_ch </code></pre> $V1 1 "乘客姓名" "HuangTianhui" "姜翠云" "李红晶" "LuiChing" "宋飞飞" 7 "唐旭东" "YangJiabao" "买买提江·阿布拉" "安文兰" "鲍媛华" "边亮京" [13] "边茂勤" "曹蕊" "车俊章" "陈长军" "陈建设" "陈昀" [19] "戴淑玲" "丁立军" "丁莹" "丁颖" "董国伟" "杜文忠" [25] "冯栋" "冯纪新" "付宝峰" "甘福祥" "甘涛" "高歌" [31] "管文杰" "韩静" "侯爱琴" "侯波" "胡偲婠(婴儿)" "胡效宁" [37] "黄毅" "姜学仁" "姜颖" "焦微微" "焦文学" "鞠坤" [43] "康旭" "黎明中" "李国辉" "李洁" "李乐" "李文博" [49] "李燕" "李宇辰" "李志锦" "李志欣" "李智" "栗延林" [55] "梁路阳" "梁旭阳" "林安南" "林明峰" "刘凤英" "刘金鹏" [61] "刘强" "刘如生" "刘顺超" "柳忠福" "楼宝棠" "卢先初" [67] "鹿建华" "罗伟" "马骏" "马文芝" "毛土贵" "么立飞" [73] "蒙高生" "孟兵" "孟凡余" "欧阳欣" "石贤文" "宋春玲" [79] "宋坤" "苏强国" "汤雪竹" "田军伟" "田清君" "汪厚彬" [85] "王春勇" "王纯华" "王丹" "王海涛" "王利军" "王林诗" [91] "王墨恒(婴儿)" "王守宪" "王淑敏" "王献军" "王永刚" $V2 1 "性别" "男" "女" "女" "女" "男" "男" "女" "男" "女" "女" "男" "女" "女" "女" "男" [17] "男" "女" "女" "男" "女" "女" "男" "男" "男" "男" "男" "男" "男" "女" "男" "女" [33] "女" "男" "女" "男" "女" "男" "女" "女" "男" "男" "男" "男" "男" "女" "男" "女" [49] "女" "男" "男" "男" "男" "男" "男" "男" "男" "男" "女" "男" "男" "男" "男" "男" [65] "男" "男" "男" "男" "男" "女" "男" "男" "男" "男" "男" "女" "男" $V3 1 "出生日期" "1948/05/28" "1952/03/27" "1994/12/09" "1969/08/02" "1982/03/01" "1983/08/03" "1988/08/25" [9] "1979/07/10" "1949/10/20" "1951/10/21" "1987/06/06" "1947/07/19" "1982/02/19" "1946/03/20" "1979/06/06" [17] "1956/03/07" "1957/08/11" "1956/12/07" "1971/04/06" "1952/04/25" "1986/10/24" "1966/10/26" "1964/06/07" [25] "1993/03/09" "1944/01/06" "1986/12/06" "1965/11/21" "1970/01/29" "1987/11/16" "1979/10/03" "1961/05/28" [33] "1969/06/24" "1979/05/15" "2011/02/25" "1980/01/01" "1984/06/18" "有待确认" "1987/04/13" "1983/05/09" [41] "1956/12/17" "1982/11/07" "1980/08/09" "1945/12/19" "1958/05/18" "1987/02/06" "1982/12/03" "1985/07/16" [49] "1983/07/19" "1987/11/06" "1984/04/14" "1979/05/22" "1973/05/05" "1985/10/26" "1954/03/26" "1984/11/12" [57] "1987/03/27" "1980/05/25" "1949/05/10" "1981/12/26" "1974/08/13" "1938/01/22" "1968/02/29" "1942/05/22" [65] "1935/04/21" "1981/10/14" "1957/03/28" "1985/08/20" "1981/12/25" "1957/08/01" "1942/08/02" "1983/06/15" [73] "1950/01/01" "1974/04/26" "1944/08/23" "1976/10/12" "1988/01/18" "1954/04/06" <pre class="prettyprint"><code> View(df_ch) </code></pre> <img src="https://i.stack.imgur.com/TPgGX.jpg" alt="chinese encoding"> <pre class="prettyprint"><code> View(df_utf8) </code></pre> <img src="https://i.stack.imgur.com/w595G.jpg" alt="enter image description here"> <h3>In RGui (64-bit)</h3> <img src="https://i.stack.imgur.com/UnJmC.jpg" alt="enter image description here"> View(df_ch) <img src="https://i.stack.imgur.com/xGpy8.jpg" alt="enter image description here"> View(df_utf8) <img src="https://i.stack.imgur.com/bAgVF.jpg" alt="enter image description here"> The good thing is you have all data in utf8 format to be used for further data analysis. Once your analysis is done, you may change the locale back to "chinese" <pre class="prettyprint"><code> Sys.setlocale(category="LC_ALL", locale = "chinese") Sys.getlocale(category="LC_ALL") [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936" </code></pre> Some functions you may need to explore for converting between character string encodings. Encoding() iconv() HTH

how to read data in utf-8 format in R?

Tags:

r

utf-8

My system:win7+R-3.0.2.

> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese 
(Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's        
republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"

There are two files with same content saved in microsoft notepad: one is saved as ansi format, the other is saved as utf8 format.The data is death name in M370 Malaysia Airlines . Or you can create the file this way.

1)copy the data into microsoft notepad.

乘客姓名,性别,出生日期
HuangTianhui,男,1948/05/28
姜翠云,女,1952/03/27
李红晶,女,1994/12/09

2)save it as test.ansi with ansi format in notepad.
3)save it as test.utf8 with utf-8 format in notepad.

read.table("test.ansi",sep=",",header=TRUE)  #can work fine
read.table("test.utf8",sep=",",header=TRUE)  #can't work

Then, i set encoding into utf-8.

options(encoding="utf-8")
read.table("test.utf8",sep=",",header=TRUE,encoding="utf-8")


 In read.table("test.utf8", sep = ",",header=TRUE,encoding = "utf-8") :
invalid input found on input connection 'test.utf8'

How can I read the data file (test.utf8)?
In python,it is so simple

rfile=open("g:\\test.utf8","r",encoding="utf-8").read()
rfile
'\ufeff乘客姓名,性别,出生日期\n\nHuangTianhui,男,1948/05/28\n\n姜翠云,女,1952/03
/27\n\n李红晶,女,1994/12/09'
rfile.replace("\n\n","\n").replace("\ufeff","").splitlines()
['乘客姓名,性别,出生日期', 'HuangTianhui,男,1948/05/28', '姜翠云,女,1952/03/27',
 '李红晶,女,1994/12/09']

Python can do such job better than R.

I do as Sathish say, problem solved a little ,still remain some.
I found that when the data is in data.frame ,it can not be displayed properly,
when the data is a column of data.frame ,it can be displayed properly,
strange enough,when the data is a row of data.frame,it can not be displayed properly .

enter image description here

783

asked Apr 05 '14 04:04

showkey

1 Answers

OS: Windows-7 (64-bit)

R Version:

package_version(R.version)

[1] ‘3.0.2’

Change your locale from "chinese" to "English_United States.1252"

  Sys.setlocale(category="LC_ALL", locale = "English_United States.1252")

  Sys.getlocale(category="LC_ALL")

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United        States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Read in data with chinese encoding

 df_ch <- read.table("test.utf8",
                     sep=",",
                     header=FALSE, 
                     encoding="chinese", 
                     stringsAsFactors=FALSE
                    )

Read in data with UTF-8 encoding

 df_utf8 <- read.table("test.utf8",
                        sep=",",
                        header=FALSE, 
                        encoding="UTF-8", 
                        stringsAsFactors=FALSE 
                      )

In RStudio Version 0.98.501

 df_ch$V1[1]

[1] "ï»¿ä¹˜å®¢å§“å"

 df_utf8$V1[1]

[2] "乘客姓名"

 df_utf8$V1

[1] "乘客姓名"         "HuangTianhui"    "姜翠云"          "李红晶"          "LuiChing"        "宋飞飞"         
[7] "唐旭东"          "YangJiabao"      "买买提江·阿布拉" "安文兰"          "鲍媛华"          "边亮京"         
[13] "边茂勤"          "曹蕊"            "车俊章"          "陈长军"          "陈建设"          "陈昀"           
[19] "戴淑玲"          "丁立军"          "丁莹"            "丁颖"            "董国伟"          "杜文忠"         
[25] "冯栋"            "冯纪新"          "付宝峰"          "甘福祥"          "甘涛"            "高歌"           
[31] "管文杰"          "韩静"            "侯爱琴"          "侯波"            "胡偲婠(婴儿)"    "胡效宁"

Display unicode data for a row from a data frame

  df_utf8[1,]
                                    V1               V2                               V3
  1 <U+FEFF><U+4E58><U+5BA2><U+59D3><U+540D> <U+6027><U+522B> <U+51FA><U+751F><U+65E5><U+671F>

Display chinese data for a row from a data frame

as.character(df_utf8[1,])

[1] "乘客姓名"  "性别"     "出生日期"

as.character(df_utf8[2,])

[1] "HuangTianhui" "男"           "1948/05/28"

Displaying multiple columns of data with international characters can be performed by converting the data frame into list and by forcing the data into character format.

  df_utf8_ch <- lapply(df_utf8, as.character)

  df_utf8_ch

$V1 1 "乘客姓名" "HuangTianhui" "姜翠云" "李红晶" "LuiChing" "宋飞飞"
7 "唐旭东" "YangJiabao" "买买提江·阿布拉" "安文兰" "鲍媛华" "边亮京"
[13] "边茂勤" "曹蕊" "车俊章" "陈长军" "陈建设" "陈昀"
[19] "戴淑玲" "丁立军" "丁莹" "丁颖" "董国伟" "杜文忠"
[25] "冯栋" "冯纪新" "付宝峰" "甘福祥" "甘涛" "高歌"
[31] "管文杰" "韩静" "侯爱琴" "侯波" "胡偲婠(婴儿)" "胡效宁"
[37] "黄毅" "姜学仁" "姜颖" "焦微微" "焦文学" "鞠坤"
[43] "康旭" "黎明中" "李国辉" "李洁" "李乐" "李文博"
[49] "李燕" "李宇辰" "李志锦" "李志欣" "李智" "栗延林"
[55] "梁路阳" "梁旭阳" "林安南" "林明峰" "刘凤英" "刘金鹏"
[61] "刘强" "刘如生" "刘顺超" "柳忠福" "楼宝棠" "卢先初"
[67] "鹿建华" "罗伟" "马骏" "马文芝" "毛土贵" "么立飞"
[73] "蒙高生" "孟兵" "孟凡余" "欧阳欣" "石贤文" "宋春玲"
[79] "宋坤" "苏强国" "汤雪竹" "田军伟" "田清君" "汪厚彬"
[85] "王春勇" "王纯华" "王丹" "王海涛" "王利军" "王林诗"
[91] "王墨恒(婴儿)" "王守宪" "王淑敏" "王献军" "王永刚"

$V2 1 "性别" "男" "女" "女" "女" "男" "男" "女" "男" "女" "女" "男" "女" "女" "女" "男"
[17] "男" "女" "女" "男" "女" "女" "男" "男" "男" "男" "男" "男" "男" "女" "男" "女"
[33] "女" "男" "女" "男" "女" "男" "女" "女" "男" "男" "男" "男" "男" "女" "男" "女"
[49] "女" "男" "男" "男" "男" "男" "男" "男" "男" "男" "女" "男" "男" "男" "男" "男"
[65] "男" "男" "男" "男" "男" "女" "男" "男" "男" "男" "男" "女" "男"
$V3 1 "出生日期" "1948/05/28" "1952/03/27" "1994/12/09" "1969/08/02" "1982/03/01" "1983/08/03" "1988/08/25" [9] "1979/07/10" "1949/10/20" "1951/10/21" "1987/06/06" "1947/07/19" "1982/02/19" "1946/03/20" "1979/06/06" [17] "1956/03/07" "1957/08/11" "1956/12/07" "1971/04/06" "1952/04/25" "1986/10/24" "1966/10/26" "1964/06/07" [25] "1993/03/09" "1944/01/06" "1986/12/06" "1965/11/21" "1970/01/29" "1987/11/16" "1979/10/03" "1961/05/28" [33] "1969/06/24" "1979/05/15" "2011/02/25" "1980/01/01" "1984/06/18" "有待确认" "1987/04/13" "1983/05/09" [41] "1956/12/17" "1982/11/07" "1980/08/09" "1945/12/19" "1958/05/18" "1987/02/06" "1982/12/03" "1985/07/16" [49] "1983/07/19" "1987/11/06" "1984/04/14" "1979/05/22" "1973/05/05" "1985/10/26" "1954/03/26" "1984/11/12" [57] "1987/03/27" "1980/05/25" "1949/05/10" "1981/12/26" "1974/08/13" "1938/01/22" "1968/02/29" "1942/05/22" [65] "1935/04/21" "1981/10/14" "1957/03/28" "1985/08/20" "1981/12/25" "1957/08/01" "1942/08/02" "1983/06/15" [73] "1950/01/01" "1974/04/26" "1944/08/23" "1976/10/12" "1988/01/18" "1954/04/06"

 View(df_ch)

chinese encoding

 View(df_utf8)

enter image description here

In RGui (64-bit)

enter image description here

View(df_ch)

enter image description here

View(df_utf8)

enter image description here

The good thing is you have all data in utf8 format to be used for further data analysis.

Once your analysis is done, you may change the locale back to "chinese"

  Sys.setlocale(category="LC_ALL", locale = "chinese")

  Sys.getlocale(category="LC_ALL")

 [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese   (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"

Some functions you may need to explore for converting between character string encodings.

Encoding()

iconv()

HTH

answered Sep 22 '22 21:09

Sathish

Related questions
                            
                                R: what is the difference between rm and remove?
                            
                                Modify x-axis labels in each facet
                            
                                Finding the GCD without looping - R
                            
                                How to export objects to parallel clusters within a function in R
                            
                                How to get axis ticks labels with different colors within a single axis for a ggplot graph?
                            
                                Define Excel's column width with R
                            
                                Merge nearest date, and related variables from a another dataframe by group
                            
                                Summing rows by month in R
                            
                                Opposite function to add_rownames in dplyr
                            
                                Include Rmd appendix after references
                            
                                plotly adding a source or caption to a chart
                            
                                Adding a vertical and horizontal scroll bar to the DT table in R shiny
                            
                                R & Fortran call
                            
                                R/GIS: Find orthogonal distance between a location and nearest line
                            
                                1-dimensional Matrix is changed to a vector in R
                            
                                R data.table grouping for lagged regression
                            
                                Extract Google Scholar results using Python (or R)
                            
                                Eliminate strip.background on one axis (ggplot2)
                            
                                Make right hand turns
                            
                                Function to generate a random password

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to read data in utf-8 format in R?

Tags:

r

utf-8

showkey

People also ask

1 Answers

In RStudio Version 0.98.501

In RGui (64-bit)

Sathish

Recent Activity

Donate For Us