Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rodbc character encoding error with PostgreSQL

I'm getting a new error which I've never gotten before when connecting from R to a GreenPlum PostgreSQL database using RODBC. I've gotten the error using both EMACS/ESS and RStudio, and the RODBC call has worked as is in the past.

library(RODBC)
gp <- odbcConnect("greenplum", believeNRows = FALSE)
data <- sqlQuery(gp, "select * from mytable")

> data
[1] "22P05 7 ERROR: character 0xc280 of encoding \"UTF8\" has no equivalent in  "WIN1252\";\nError while executing the query" 
[2] "[RODBC] ERROR: Could not SQLExecDirect 'select * from mytable'"

EDIT: Just tried querying another table and did get results. So I guess it's not an RODBC problem but a PostgreSQL table encoding problem.

R version 2.13.0 (2011-04-13)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RODBC_1.3-2
> 
like image 829
wahalulu Avatar asked Aug 23 '11 20:08

wahalulu


2 Answers

First, the issue arises because R is trying to convert to a Windows locale that supports UTF8. Unfortunately, Brian Ripley has reported numerous times that Windows has no UTF8 locales. From hours spent searching the web, StackOverflow, Microsoft, etc., I have come to the conclusion that Microsoft hates UTF-8 Windows won't support UTF8.

As a result, I'm not sure that there's an easy solution to this, if there is any solution at all. The best I can recommend is to wrap some kind of conversion on the server side, look at filtering the data if you can, or try a different language, if appropriate (e.g. Chinese, Japanese, Korean).

If you do decide to wrap a converter, unicode.org recommends this ICU toolkit.

like image 132
Iterator Avatar answered Oct 05 '22 23:10

Iterator


0xc280 is a control element ( U+0080 in Unicode) that is causing trouble pretty often when using SQL and the likes. The problem often lies in the conversion chain that invariably happens when you use different applications that use different encoding schemes. Windows has UTF-8 included by now, so it's not strictly a Windows problem. I believe the problem arises before R reads the data in.

In fact, in the chain the character sequence 0x80 in UNICODE will be mapped to 0xc280 in UTF-8. This is supposed to be a control sequence, and cannot be printed. But chances are big that the 0x80 is in fact not UNICODE, but Windows Latin-1 or Latin-2. In that case, the 0x80 represents the euro sign. That might explain how it ends up in your data. Check if you can find something like that in the data, that would explain something already.

My guess is that the solution will not lie at the R-end of this workchain, but before that. It will try automatic conversion, but this one is reported to fail in some cases (also for SQL and Oracle btw). Check in which encoding you're working in Postgresql, and try to use any of the latin types. There might be other links involved (a Putty or similar terminal for example). I'm pretty sure all the encodings there are ISO8859-1, which is Latin-1. Somewhere UTF-8 gets thrown in between, and when the 0x80 character gets wrongly mapped to 0xc280, you get trouble.

So check the encodings in your complete workchain, and make sure that they all match. If they don't, the automatic conversion done between each step is bound to give trouble for some characters.

Hope this helps.

like image 45
Joris Meys Avatar answered Oct 05 '22 21:10

Joris Meys