Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RODBC and Microsoft SQL Server: Truncating Long Character Strings

I am trying to query a variable from a Microsoft SQL Server database using R/RODBC. RODBC is truncating the character string at 8000 characters.

Original code: truncates at 255 characters (as per RODBC documentation) library(RODBC) con_string <- odbcConnect("DSN") query_string <- "SELECT text_var FROM table_name" dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)

Partial solution: modifying query string truncate text after 7999 characters. library(RODBC) con_string <- odbcConnect("DSN") query_string <- "SELECT [text_var]=CAST(text_var AS VARCHAR(8000)) FROM table_name" dat <- sqlQuery(con_string, query_string, stringsAsFactors=FALSE)

The table/variable contains text strings at long as 250,000 characters. I really want to work with all the text in R. Is this possible?

@BrianRipley discusses the problem (but no solution) on page 18 of following document: https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf

@nutterb dicusses similar issues with RODBCext package on GitHub: https://github.com/zozlak/RODBCext/issues/6

Have seen similar discussion on SO, but no solution using RODBC with VARCHAR>8000.

RODBC sqlQuery() returns varchar(255) when it should return varchar(MAX)

RODBC string getting truncated

Note:

  • R 3.3.2
  • Microsoft SQL Server 2012
  • Linux RHEL 7.1
  • Microsoft ODBC Driver for SQL Server
like image 953
Chris Avatar asked May 29 '17 20:05

Chris


People also ask

How do I trim a string to a specific length in SQL?

SQL Server TRIM() Function The TRIM() function removes the space character OR other specified characters from the start or end of a string. By default, the TRIM() function removes leading and trailing spaces from a string.

Is there a LEN function in SQL?

SQL Server LEN() FunctionThe LEN() function returns the length of a string. Note: Trailing spaces at the end of the string is not included when calculating the length. However, leading spaces at the start of the string is included when calculating the length.

What is the length of text data type in SQL Server?

Variable-length non-Unicode data in the code page of the server and with a maximum string length of 2^31-1 (2,147,483,647). When the server code page uses double-byte characters, the storage is still 2,147,483,647 bytes. Depending on the character string, the storage size may be less than 2,147,483,647 bytes.


2 Answers

Since this is a limitation of the ODBC driver provided by Microsoft, there's little to be done until they make a change to the driver. @zozlak explains why in the GitHub issue you linked to.

I tend to use stored procedures to get around this when I need to, but that usually requires writing a stored procedure for each particular instance. At some point, I may come up with a way to do this in a stored procedure more generically, but I find the process of constructing queries in a stored procedure to be tedious and frustrating.

For that reason, I just spent some time building a function that will do limited queries that involve VARCHAR(MAX) variables. It's a brute force approach that for a variable of 17000 characters will export it in three variables and paste them together in R. It's crude, probably not very efficient, but the best solution I've come up with yet.

The other limitation is that it doesn't allow you to rename variables within the query. You'll be stuck with the variables as they are named in the database. If you're only involving a couple of tables, that may not be a problem. In very complex database, this can be problematic. But, at least with this you could query just the VARCHAR(MAX) variables with a handful of necessary IDs on which to perform a merge in R.

As discussed in the GitHub issue, it's probably best to avoid VARCHAR(MAX) as much as possible. If an unknown length is truly necessary, VARBINARY(MAX) is a little easier to query.

Example

source("https://gist.githubusercontent.com/nutterb/d2e050dada608bb6213e61d0f8471b65/raw/be8717f318b3e3087e7c26c9a8f9d0a582a5daef/query_varchar_max"

channel <- odbcDriverConnect(...)

query_varchar_max(channel = channel,
                  id = c("idvar"),
                  varchar_max = c("varchar_max_var", "varchar_max_var2"),
                  from = "FROM dbo.table_name WHERE group = ?",
                  data = list(group = "A"))

Function Code

#' @name query_varchar_max
#' @title Query a VARCHAR(MAX) Variable from SQL Server
#' 
#' @description The RODBC driver to SQL Server (SQL Server Native Client 11.0)
#'   reports the lenght of a VARCHAR(MAX) variable to be zero.  This presents 
#'   difficulties in extracting long text values from the database. Often, the
#'   ODBC will assume a length of 255 characters and truncate the text to that
#'   many characters.  The approach taken here searches the VARCHAR(MAX) variables 
#'   for the longest length, and extracts the data in segments to be pasted 
#'   together in R.  
#'   
#' @param channel A valid ODBC channel to a SQL Server database.
#' @param id A character vector of ID variables that may be used to merge the 
#'   data from this query into another dataset.
#' @param varchar_max a character vector of variable names that are to be 
#'   treated as if they are VARCHAR(MAX) variables. 
#' @param from A single character string providing the remainder of the query 
#'   to be run, beginning with the \code{FROM} statement.
#' @param stringsAsFactors \code{logical(1)}. Should character strings returned 
#'   from the database be converted to factors?
#' @param ... Additional arguments to \code{sqlExecute} when running the full 
#'   query.
#'   
#' @details \code{query_varchar_max} operates by determining how many columns of up to
#'   8000 characters each are required to export a complete VARCHAR(MAX) variable.
#'   It then creates the necessary number of intermediate variables and queries the 
#'   data using the SQL Server \code{SUBSTRING} command, extracting the VARCHAR(MAX)
#'   variable in increments of 8000 characters. After completing the query, 
#'   the intemediary variables are concatenated and removed from the data.
#'   
#'   The function makes accommodation for multi-part queries as far as [TABLE].[VARIABLE]
#'   formats are concerned. It is not intended for use in [SCHEMA].[TABLE].[VARIABLE]
#'   formats. This at least allows \code{from} to include joins for more complex 
#'   queries.  Parameterized queries are also supported through \code{sqlExecute}.
#'
#' @export

query_varchar_max <- function(channel, id, varchar_max, from, 
                              stringsAsFactors = FALSE, ...)
{
  coll <- checkmate::makeAssertCollection()

  checkmate::assert_class(x = channel,
                          classes = "RODBC",
                          add = coll)

  checkmate::assert_character(x = id,
                              add = coll)

  checkmate::assert_character(x = varchar_max,
                              add = coll)

  checkmate::assert_character(x = from,
                              len = 1,
                              add = coll)

  checkmate::assert_logical(x = stringsAsFactors,
                            len = 1,
                            add = coll)

  checkmate::reportAssertions(coll)

  varchar_max_len <-
    paste0(
      sprintf("MAX(LEN(%s)) AS len_%s", 
              varchar_max,
              sub("[.]", "_", varchar_max)),
      collapse = ", "
    )

  varchar_len <- 
    unlist(
      RODBCext::sqlExecute(
        channel = channel,
        query = sprintf("SELECT %s %s",
                        varchar_max_len,
                        from),
        fetch = TRUE
      )
    )

  varchar_max_cols <- 
    unlist(
      mapply(expand_varchar_max,
             varchar_max,
             varchar_len,
             SIMPLIFY = FALSE)
    )

  Prelim <- 
    RODBCext::sqlExecute(
      channel = channel,
      query = sprintf("SELECT %s, %s %s",
                      paste0(id, collapse = ", "), 
                      paste0(varchar_max_cols, collapse = ", "),
                      from),
      fetch = TRUE,
      stringsAsFactors = stringsAsFactors,
      ...
    )

  var_stub_to_combine <-
    unique(
      sub(
        "(part)(\\d{1,3})", 
        "\\1",
        sub(".+AS ", "", varchar_max_cols)
      )
    )

  col_to_combine <- 
    lapply(var_stub_to_combine,
           grep,
           names(Prelim))

  Prelim[sub(".+[.]", "", varchar_max)] <-
    lapply(col_to_combine,
           function(col) apply(Prelim[col], 1, paste0, collapse = ""))

  Prelim[-unlist(col_to_combine)]

}


expand_varchar_max <- function(varchar_max, varchar_len)
{
  nvar <- varchar_len %/% 8000 + 1

  var_list <- vector("character", length = nvar)

  for (i in seq_along(var_list))
  {
    var_list[i] <- 
      sprintf("SUBSTRING(%s, %s, %s) AS %s_part%s",
              varchar_max,
              1 + (i - 1) * 8000,
              8000,
              paste0(sub("[.]", "_", varchar_max)),
              i)
  }
  var_list
}
like image 58
Benjamin Avatar answered Oct 26 '22 22:10

Benjamin


Maybe this has changed recently, but with the following query format getting varchar(MAX) fields longer then 8000 character can be queried from SQL Server into R

query_string <- "SELECT CAST(text_var AS text) AS text_var FROM table_name"
like image 28
András Avatar answered Oct 26 '22 21:10

András