Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 encoding with dplyr and SQLite

I have a table in SQLite and I’d like to open it with dplyr. I use SQLite Expert Version 35.58.2478, R Studio Version 0.98.1062 on a PC with Win 7.

After connecting to the database with src_sqlite() and reading with tbl() I get the table. But the character enconding is wrong. Reading the same table from a csv-file just works by adding encoding = “utf-8” to the function read.csv but in this case another error in the first column name occurs (please consider the minimal example below).

Note that in the SQLite table the encoding is UTF-8 and SQLite displays the data correctly.

I tried to change the encoding in R Studio options with no success. Also changing the region in windows or in r doesn’t have any effect.

Is there any solution of getting the characters in the table correctly into r using dplyr?

Minimal Example

library(dplyr)
db <- src_sqlite("C:/Users/Jens/Documents/SQLite/my_db.sqlite")
tbl(db, "prozesse")
## Source: sqlite 3.7.17 [C:/Users/Jens/Documents/SQLite/my_db.sqlite]
## From: prozesse [4 x 4]
## 
##   KH_ID Einschätzung Prozess Gruppe
## 1     1             3 Buchung     IT
## 2     2             4 Buchung     IT
## 3     3             3 Buchung    OLP
## 4     4             5 Buchung    OLP

You see the wrong encoding in the name of the second column. This issue occures as well in the colums with ä, ö, ü etc.

The name of the second column is displayed correctly, but the first column is wrong:

read.csv("C:/Users/Jens/Documents/SQLite/prozess.csv", encoding = "UTF-8")
##   X.U.FEFF.KH_ID Einschätzung Gruppe Prozess
## 1              1            3     PO  visite
## 2              2            3     IT  visite
## 3              3            3     IT  visite
## 4              2            3     PO  visite


sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] RSQLite.extfuns_0.0.1 RSQLite_0.11.4        DBI_0.3.0            
## [4] dplyr_0.2            
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1  digest_0.6.4    evaluate_0.5.5  formatR_1.0    
##  [5] htmltools_0.2.6 knitr_1.6       parallel_3.1.1  Rcpp_0.11.2    
##  [9] rmarkdown_0.3.3 stringr_0.6.2   tools_3.1.1     yaml_2.1.13
like image 451
jnshsrs Avatar asked Sep 19 '14 13:09

jnshsrs


1 Answers

I had the same problem. I solved it like below. However, I do not guarantee that the solution is rock solid. Give it a try:

library(dplyr)
library(sqldf)

# Modifying built-in mtcars dataset

mtcars$test <- 
  c("č", "ž", "š", "č", "ž", "š", letters) %>% 
  enc2utf8(.)

mtcars$češćžä <- 
  c("č", "ž", "š", "č", "ž", "š", letters) %>% 
  enc2utf8(.)

names(mtcars) <- 
  iconv(names(mtcars), "cp1250", "utf-8")

# Connecting to sqlite database

my_db <- src_sqlite("my_db.sqlite3", create = T)

# exporting mtcars dataset to database
copy_to(my_db, mtcars, temporary = FALSE)

# dbSendQuery(my_db$con, "drop table mtcars")

# getting data from sqlite database
my_mtcars_from_db <-
  collect(tbl(my_db, "mtcars"))

# disconnecting from database
dbDisconnect(my_db$con)

convert_to_encoding() function

# a function that encodes 
# column names and values in character columns
# with specified encodings
convert_to_encoding <- 
  function(x, from_encoding = "UTF-8", to_encoding = "cp1250"){

    # names of columns are encoded in specified encoding
    my_names <- 
      iconv(names(x), from_encoding, to_encoding) 

    # if any column name is NA, leave the names
    # otherwise replace them with new names
    if(any(is.na(my_names))){
      names(x)
    } else {
      names(x) <- my_names
    }

    # get column classes
    x_char_columns <- sapply(x, class)
    # identify character columns
    x_cols <- names(x_char_columns[x_char_columns == "character"])

    # convert all string values in character columns to 
    # specified encoding
    x <- 
      x %>%
      mutate_each_(funs(iconv(., from_encoding, to_encoding)), 
                   x_cols)
    # return x
    return(x)
  }

# use
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")

Results

# before conversion
my_mtcars_from_db

Source: local data frame [32 x 13]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb češćžä test
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4          ÄŤ   ÄŤ
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4          Ĺľ   Ĺľ
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1          š   š
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1          ÄŤ   ÄŤ
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2          Ĺľ   Ĺľ
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1          š   š
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4           a    a
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2           b    b
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2           c    c
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4           d    d
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...         ...  ...

# after conversion
convert_to_encoding(my_mtcars_from_db, "UTF-8", "cp1250")

Source: local data frame [32 x 13]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb test češćžä
1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4    č      č
2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4    ž      ž
3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    š      š
4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    č      č
5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    ž      ž
6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    š      š
7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    a      a
8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    b      b
9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    c      c
10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4    d      d
..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...  ...    ...

Session information

devtools::session_info()

Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, mingw32             
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  Slovenian_Slovenia.1250     
 tz       Europe/Prague               

Packages -----------------------------------------------------------------------
 package    * version date       source        
 assertthat * 0.1     2013-12-06 CRAN (R 3.2.0)
 chron      * 2.3-45  2014-02-11 CRAN (R 3.2.0)
 DBI          0.3.1   2014-09-24 CRAN (R 3.2.0)
 devtools   * 1.7.0   2015-01-17 CRAN (R 3.2.0)
 dplyr        0.4.1   2015-01-14 CRAN (R 3.2.0)
 gsubfn       0.6-6   2014-08-27 CRAN (R 3.2.0)
 lazyeval   * 0.1.10  2015-01-02 CRAN (R 3.2.0)
 magrittr   * 1.5     2014-11-22 CRAN (R 3.2.0)
 proto        0.3-10  2012-12-22 CRAN (R 3.2.0)
 R6         * 2.0.1   2014-10-29 CRAN (R 3.2.0)
 Rcpp       * 0.11.6  2015-05-01 CRAN (R 3.2.0)
 RSQLite      1.0.0   2014-10-25 CRAN (R 3.2.0)
 rstudioapi * 0.3.1   2015-04-07 CRAN (R 3.2.0)
 sqldf        0.4-10  2014-11-07 CRAN (R 3.2.0)
like image 51
Miha Trošt Avatar answered Nov 16 '22 21:11

Miha Trošt