Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get column types of an SQL table with dplyr

Tags:

sqlite

r

dplyr

Is there a dplyr (or other package) command for getting the column (field?) types of an SQL table? For example...

library(RSQLite)
library(dplyr)

data(iris)

dat_sql <- src_sqlite("test.sqlite", create = TRUE)
copy_to(dat_sql, iris, name = "iris_df")

iris_tbl <- tbl(dat_sql, "iris_df")
iris_tbl
# Source:   query [?? x 5]
# Database: sqlite 3.8.6 [test.sqlite]
# 
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
# 1           5.1         3.5          1.4         0.2  setosa
# 2           4.9         3.0          1.4         0.2  setosa
# 3           4.7         3.2          1.3         0.2  setosa
# 4           4.6         3.1          1.5         0.2  setosa
# 5           5.0         3.6          1.4         0.2  setosa
# 6           5.4         3.9          1.7         0.4  setosa
# 7           4.6         3.4          1.4         0.3  setosa
# 8           5.0         3.4          1.5         0.2  setosa
# 9           4.4         2.9          1.4         0.2  setosa
# 10          4.9         3.1          1.5         0.1  setosa
# # ... with more rows

I'm interested in a command that would tell me that the first four columns are of type dbl and the last is a chr (or better yet, the R types numeric and character) without actually collecting the data in memory. Since it is printed, there has to be a way to do this, right? I tried str to no avail:

str(iris_tbl)
# List of 2
#  $ src:List of 2
#   ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
#   .. .. ..@ Id                 :<externalptr> 
#   .. .. ..@ dbname             : chr "test.sqlite"
#   .. .. ..@ loadable.extensions: logi TRUE
#   .. .. ..@ flags              : int 6
#   .. .. ..@ vfs                : chr ""
#   ..$ path: chr "test.sqlite"
#   ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
#  $ ops:List of 3
#   ..$ src :List of 2
#   .. ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
#   .. .. .. ..@ Id                 :<externalptr> 
#   .. .. .. ..@ dbname             : chr "test.sqlite"
#   .. .. .. ..@ loadable.extensions: logi TRUE
#   .. .. .. ..@ flags              : int 6
#   .. .. .. ..@ vfs                : chr ""
#   .. ..$ path: chr "test.sqlite"
#   .. ..- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
#   ..$ x   :Classes 'ident', 'sql', 'character'  chr "iris_df"
#   ..$ vars: chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
#  - attr(*, "class")= chr [1:4] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
# NULL
like image 971
Alexey Shiklomanov Avatar asked Sep 06 '16 20:09

Alexey Shiklomanov


2 Answers

When printing a preview of the remote table, it looks like dplyr does use collect on the first few rows of the table. Because dplyr retrieves some sample data, you could do this as well.

Here, we make a query for the first few rows with head, collect the query results, and inspect the class of each column.

iris_tbl %>% 
  head %>% 
  collect %>% 
  lapply(class) %>% 
  unlist
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#>    "numeric"    "numeric"    "numeric"    "numeric"  "character" 

(When used with a data-frame, lapply does column-wise function application, so it applies class to each column.)

To get the types names that dplyr uses, use type_sum.

iris_tbl %>% head %>% collect %>% lapply(type_sum) %>% unlist
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#>        "dbl"        "dbl"        "dbl"        "dbl"        "chr" 
like image 171
TJ Mahr Avatar answered Sep 22 '22 00:09

TJ Mahr


Have a look at glimpse()

This is like a transposed version of print: columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It's a little like str applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)

Which gives:

> glimpse(iris_tbl)
#Observations: NA
#Variables: 5
#$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0,...
#$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4,...
#$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5,...
#$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2,...
#$ Species      <chr> "setosa", "setosa", "setosa", "setosa",...

Should you want to get a vector you could do:

vapply(as.data.frame(head(iris_tbl)), typeof, character(1))

Which gives:

#Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "double"     "double"     "double"     "double"  "character" 
like image 35
Steven Beaupré Avatar answered Sep 22 '22 00:09

Steven Beaupré