Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 / Unicode Text Encoding with RPostgreSQL

Tags:

I'm running R on a Windows machine which is directly linked to a PostgreSQL database. I'm not using RODBC. My database is encoded in UTF-8 as confirmed by the following R command:

dbGetQuery(con, "SHOW CLIENT_ENCODING") #   client_encoding # 1            UTF8 

However, when some text is read into R, it displays as strange text in R.

For example, the following text is shown in my PostgreSQL database: "Stéphane"

After exporting to R it's shown as: "Stéphane" (the é is encoded as é)

When importing to R I use the dbConnect command to establish a connection and the dbGetQuery command to query data using SQL. I do not specify any text encoding anywhere when connecting to the database or when running a query.

I've searched online and can't find a direct resolution to my issue. I found this link, but their issue is with RODBC, which I'm not using.

This link is helpful in identifying the symbols, but I don't just want to do a find & replace in R... way too much data.

I did try running the following commands below and I arrived at a warning.

Sys.setlocale("LC_ALL", "en_US.UTF-8") # [1] "" # Warning message: # In Sys.setlocale("LC_ALL", "en_US.UTF-8") : #   OS reports request to set locale to "en_US.UTF-8" cannot be honored Sys.setenv(LANG="en_US.UTF-8") Sys.setenv(LC_CTYPE="UTF-8") 

The warning occurs on the Sys.setlocale("LC_ALL", "en_US.UTF-8") command. My intuition is that this is a Windows specific issue and doesn't occur with Mac/Linux/Unix.

like image 255
David L Avatar asked Jan 27 '14 22:01

David L


People also ask

Are Unicode and UTF-8 the same?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What is difference between UTF-8 and UTF-8?

The main difference between UTF-8, UTF-16, and UTF-32 character encoding is how many bytes it requires to represent a character in memory. UTF-8 uses a minimum of one byte, while UTF-16 uses a minimum of 2 bytes.

What is UTF-8 encoding used for?

UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.

How do I identify a UTF-8 character?

If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.


1 Answers

As Craig Ringer said, setting client_encoding to windows-1252 is probably not the best thing to do. Indeed, if the data you're retrieving contains a single exotic character, you're in trouble:

Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: character 0xcca7 of encoding "UTF8" has no equivalent in "WIN1252" )

On the other hand, getting your R environment to use Unicode could be impossible (I have the same problem as you with Sys.setlocale... Same in this question too.).

A workaround is to manually declare UTF-8 encoding on all your data, using a function like this one:

set_utf8 <- function(x) {   # Declare UTF-8 encoding on all character columns:   chr <- sapply(x, is.character)   x[, chr] <- lapply(x[, chr, drop = FALSE], `Encoding<-`, "UTF-8")   # Same on column names:   Encoding(names(x)) <- "UTF-8"   x } 

And you have to use this function in all your queries:

set_utf8(dbGetQuery(con, "SELECT myvar FROM mytable")) 

EDIT: Another possibility is to use RPostgres unstead of RPostgreSQL. I tested it (with the same config as in your question), and as far as I can see all declared encodings are automatically set to UTF-8.

like image 173
Scarabee Avatar answered Sep 28 '22 04:09

Scarabee