Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RODBC: chars and numerics converted aggressively (with/without as.is)

Tags:

r

rodbc

Related to https://stackoverflow.com/a/33284035/3358272, I'm finding inconsistent behavior with pulling data from SQL Server (2014).

library(RODBC)
sqlQuery(.conn, "CREATE TABLE r2test ( [mychar] [NVARCHAR](16), [mynum] [FLOAT])")
# character(0)
sqlQuery(.conn, "INSERT INTO r2test (mychar,mynum) VALUES ('1',3.141593),('2',6.283185)")
character(0)
str(sqlQuery(.conn, "SELECT * FROM r2test", stringsAsFactors = FALSE))
# 'data.frame': 2 obs. of  2 variables:
#  $ mychar: int  1 2
#  $ mynum : num  3.14 6.28

In that example we see the undesired behavior: the characters of mychar are being internally converted to integers. Per the previously-mentioned SO answer, the as.is option defeats this, but has the unfortunate side-effect of also forcing decimal-represented floats to strings:

str(sqlQuery(.conn, "SELECT * FROM r2test", stringsAsFactors = FALSE, as.is = TRUE))
# 'data.frame': 2 obs. of  2 variables:
#  $ mychar: chr  "1" "2"
#  $ mynum : chr  "3.1415929999999999" "6.2831849999999996"

If at least one of mychar is actually not integer-izable, things are fine:

sqlQuery(.conn, "INSERT INTO r2test (mychar,mynum) VALUES ('a',9.424778)")
# character(0)
str(sqlQuery(.conn, "SELECT * FROM r2test", stringsAsFactors = FALSE))
# 'data.frame': 3 obs. of  2 variables:
#  $ mychar: chr  "1" "2" "a"
#  $ mynum : num  3.14 6.28 9.42

Unfortunately, the data model does not support arbitrarily adding something to encourage this behavior (or I just haven't thought of a good-enough way of doing it). The data model is such that values of mychar include 01 and 1 which are character-wise distinct. The only workaround I've found is to use as.is = TRUE, which will require me to as.numeric all related columns, something that is both tedious and (theoretically) unnecessary work.

Since the docs suggest needing to set DBMSencoding, I checked the current encoding (helped by https://stackoverflow.com/a/5182469/3358272):

sqlQuery(.conn, "SELECT SERVERPROPERTY('Collation')")
# 1 SQL_Latin1_General_CP1_CI_AS

I've tried using (for kicks): DBMSencoding="latin1", DBMSencoding="UTF-8", and explicitly though the default DBMSencoding="" with no change in behavior.

How can I encourage the behavior of not over-coercing the data types?

Currently using R-3.2.5 and RODBC-1.3.13 on ubuntu.

like image 456
r2evans Avatar asked Jul 18 '16 17:07

r2evans


1 Answers

If I understand correctly, I think this if what you are looking for,

str(sqlQuery(
  .conn, 
  "SELECT * FROM r2test", 
  stringsAsFactors = FALSE,
  as.is = c(TRUE, FALSE)
))
#'data.frame':  2 obs. of  2 variables:
# $ mychar: chr  "1" "2"
# $ mynum : num  3.14 6.28

where as.is is specified as a logical vector (must have the same length as the number of columns in the result set). To be fair, this isn't really spelled out very well. The man page for sqlQuery just refers you to the as.is argument in read.table, which states:

Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped.

The downside of this approach is that you need to know in advance which columns you want to convert and which you don't. Personally I don't see why the default behavior isn't to just map SQL character types to R character types, SQL numeric types to R numeric types, etc, but perhaps there is a good reason for this on the backend. Automatically converting '1', '2', ... to integers does not seem like much of a "feature" to me.

like image 53
nrussell Avatar answered Sep 23 '22 05:09

nrussell