I'm trying to read the following UTF-8 encoded file in R, but whenever I read it, the unicode characters are not encoded correctly:
The script I'm using to process the file is as follows:
defaultEncoding <- "UTF8"
detalheVotacaoMunicipioZonaTypes <- c("character", "character", "factor", "factor", "factor", "factor", "factor",
"factor", "factor", "factor", "factor", "factor", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "character", "character")
readDetalheVotacaoMunicipioZona <- function( fileName ) {
fileConnection = file(fileName,encoding=defaultEncoding)
contents <- readChar(fileConnection, file.info(fileName)$size)
close(fileConnection)
contents <- gsub('"', "", contents)
columnNames <- c("data_geracao", "hora_geracao", "ano_eleicao", "num_turno", "descricao_eleicao", "sigla_uf", "sigla_ue",
"codigo_municipio", "nome_municipio", "numero_zona", "codigo_cargo", "descricao_cargo", "qtd_aptos",
"qtd_secoes", "qtd_secoes_agregadas", "qtd_aptos_tot", "qtd_secoes_tot", "qtd_comparecimento",
"qtd_abstencoes", "qtd_votos_nominais", "qtd_votos_brancos", "qtd_votos_nulos", "qtd_votos_legenda",
"qtd_votos_anulados", "data_ult_totalizacao", "hora_ult_totalizacao")
read.csv(text=contents,
colClasses=detalheVotacaoMunicipioZonaTypes,
sep=";",
col.names=columnNames,
fileEncoding=defaultEncoding,
header=FALSE)
}
I read the file sending in the UTF-8 encoding, remove all quotes (even numbers are quoted, so I need to clean them up) and then feed the contents to read.csv
. It reads and processes the file correctly but it seems like it's not using the encoding information I'm giving it.
What should I do to make it use UTF-8 to read this file?
I'm using RStudio on OSX if it makes any difference.
Setting the Default Encoding If you don't set a default encoding, files will be opened using UTF-8 (on Mac desktop, Linux desktop, and server) or the system's default encoding (on Windows). When saving a previously unsaved file, RStudio will ask you to choose an encoding if non-ASCII characters are present.
Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes" . These declarations can be read by Encoding , which will return a character vector of values "latin1" , "UTF-8" "bytes" or "unknown" , or set, when value is recycled as needed and other values are silently treated as "unknown" .
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.
The encoding standard that is saved with a text file provides the information that your computer needs to display the text on the screen. For example, in the Cyrillic (Windows) encoding, the character Й has the numeric value 201.
This problem is caused by the wrong locale being set, whether inside RStudio or command-line R:
If the problem only happens in RStudio not command-line R, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try Windows-1252, UTF-8 or ISO8859-1('latin1') (or else 'Ask' if you always want to be prompted). Screenshot attached at bottom. Let us know which one worked!
If the problem also happens in command-line R, do the following:
Do locale -m
on your Mac and tell us whether it supports CP1252 or else ISO8859-1 ('latin1')? Dump the list of supported locales if you need to. (You might as well tell us your version of MacOS while you're at it.)
For both of those locales, try to change to that locale:
# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'
# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")
# Try "pt_PT.UTF-8" too...
# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")
That should work.
Strictly the Sys.setlocale()
command should go in your ~/.Rprofile
for startup, not inside your R session or source-code.
However Sys.setlocale()
can fail, so just be aware of that. Also, assert Sys.getlocale()
inside your setup code early and often, as I do. (really, read.csv
should figure out if the encoding it uses is compatible with the locale, and warn or error if not).
Let us know which fix worked! I'm trying to document this more generally so we can figure out the correct enhance.
It works fine for me.
Did you try to change/reset locale?
in my case it works with
Sys.setlocale(category = "LC_ALL", locale = "Portuguese_Portugal.1252")
d <- read.table(text=readClipboard(), header=TRUE, sep = ';')
head(d)
1 25/04/2014 22:29:30 2012 1 ELEIÇÃO MUNICIPAL 2012 PB 20419 20419 ITAPORANGA 33 13 VEREADOR 17157
2 25/04/2014 22:29:30 2012 1 ELEIÇÃO MUNICIPAL 2012 PB 20770 20770 MALTA 51 11 PREFEITO 4677
3 25/04/2014 22:29:30 2012 1 ELEIÇÃO MUNICIPAL 2012 PB 21091 21091 OLHO D'ÁGUA 32 13 VEREADOR 6653
4 25/04/2014 22:29:30 2012 1 ELEIÇÃO MUNICIPAL 2012 PB 21113 21113 OLIVEDOS 23 13 VEREADOR 3243
...
I had the same problem with Portuguese locale in r (MAC OS 10.12.3)
I've tried as per thread above and no one worked. Then I found this webpage: https://docs.moodle.org/dev/Table_of_locales
and just tried Sys.setlocale(category = "LC_ALL", locale = "pt_PT.UTF-8")
and it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With