Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to turn off DESCRIBE in R dplyr sql

Tags:

sql

r

dplyr

impala

I'm using R shiny and dplyr to connect to a database and query the data in Impala. I do the following.

con <- dbPool(odbc(),
Driver = [DIVER],
Host = [HOST],
Schema = [SCHEMA],
Port = [PORT],
UID = [USERNAME],
PWD = [PASSWORD])

table_foo <- tbl(con, [TABLE_NAME_FOO])
table_bar <- tbl(con, [TABLE_NAME_BAR])

When I run a query for instance:

table %>% filter(name=greg) %>% collect()

There are describes that are being run on the impala:

DESCRIBE TABLE 'table_foo'
DESCRIBE TABLE 'table_bar'

All the describes run before every query.(Every collect()) I have many tables all these describes are wasting substantial amount of time. Especially on impala where some describes can take a while to run.

Is there a way to turn these off? Is there anything I can do? I looked at the docs and could not find anything: https://db.rstudio.com/dplyr/

like image 561
bink1time Avatar asked Aug 12 '19 19:08

bink1time


2 Answers

My hypothesis is that this is a result of using the pool package and the way it manages multiple connections to a database. However your question does not give enough details for me to be certain, and I can not recreate your situation in my local environment.

If the problem is a result of the pool package, my suggestion would be to switch to using the dbplyr package directly. This provides a translation of dplyr commands into the database language.

# packages
library(DBI)
library(dplyr)
library(dbplyr)

# Connection string (see note below)
connection_string = "DRIVER = ...; DATABASE = ...; HOST = ..."
# connection
con = dbConnect(odbc::odbc(), .connection_string = connection_string)

# connect to two tables
table_foo <- tbl(con, "TABLE_NAME_FOO")
table_bar <- tbl(con, "TABLE_NAME_BAR")

# review queries
table_foo %>% explain()
table_foo %>% filter(name=greg) %>% explain()
# fetch data
table_foo %>% filter(name=greg) %>% collect()

This link appears to provide instruction as to how to create an Impala connection string. But given your setup, you might have an equivalent way.

If you try the above, and still observe DESCRIBE TABLE being run, then it could be that describing tables is required for returning results as @Lukasz suggests - and the nature of an Impala connection is that it can not tell which tables are needed.

You also mention in your opening line that you are using R Shiny. The dynamic nature of shiny apps means it is not always intuitive which parts of the code run. So if you have not tested the above in isolation, it is possible that DESCRIBE TABLE is being called by a different part of your shiny app that you did not realize was being run.

like image 83
Simon.S.A. Avatar answered Oct 19 '22 23:10

Simon.S.A.


Maybe something like this ? I can't reproduce your issue so sorry if it doesn't help :

lazy_tbl <- table %>% filter(name=greg)
sql_query <- dbplyr::sql_render(lazy_tbl) %>% gsub("\\n"," ", .)
# print sql_query and see if there are DESCRIBE statements to get out
output <- DBI::dbGetQuery(con, sql_query)

Then once you get it right wrap it up in a custom collect function :

collect2 <- function(x){
  dbplyr::sql_render(x) %>% 
    gsub("\\n"," ", .) %>%
    # other steps ?
    DBI::dbGetQuery(x$src$con, .)
}
like image 38
Moody_Mudskipper Avatar answered Oct 19 '22 23:10

Moody_Mudskipper