Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase time before tidyverse API OAuth token expires when using bigquery?

When using bigrquery from R, the tidyverse API asks for authentication, which, once provided, allows bigquery to be accessed from the R session. Downloading results can be time consuming, and the if the oauth token expires mid transfer the download fails:

Error: Invalid Credentials [authError]

Example code

# install.packages("tidyverse"); install.packages('bigrquery')
Sys.setenv(BIGQUERY_TEST_PROJECT="your-gcp-project-id") 
library(tidyverse)
library(dbplyr)
library(bigrquery)

billing <- bq_test_project()

connection <- dbConnect(
  bigrquery::bigquery(),
  project = "your-gcp-project-id",
  dataset = "dataset-name",
  billing = billing
)

mytable <- tbl(connection, "mytable")

mytable %>%
  -- some heavy dplyr wrangling --
  %>% collect()

Which returns

Running job [/] 20s
Complete
Billed 400GB
Downloading 230,000,000 rows
Downloading data [=====--------] 28% ETA: 24h

but after some time

Error: Invalid Credentials [authError]

Question

How can the length of time before the 0Auth token expires be increased?

like image 238
stevec Avatar asked May 08 '20 10:05

stevec


2 Answers

I also have the same error and I get a timeout of the issue.

Another potential solution is exporting to google data studio and then download it through a csv as a work-around.

Or for large datasets, this is best done via extracting the BigQuery result to Google Cloud Storage, then downloading the data from there, as per this article: https://mran.microsoft.com/web/packages/bigQueryR/vignettes/bigQueryR.html

    ## Create the data extract from BigQuery to Cloud Storage
    job_extract <- bqr_extract_data("your_project",
                                    "your_dataset",
                                    "bigResultTable",
                                    "your_cloud_storage_bucket_name")

    ## poll the extract job to check its status
    ## its done when job$status$state == "DONE"
    bqr_get_job("your_project", job_extract$jobReference$jobId)

    ## to download via a URL and not logging in via Google Cloud Storage interface:
    ## Use an email that is Google account enabled
    ## Requires scopes:
    ##  https://www.googleapis.com/auth/devstorage.full_control
    ##  https://www.googleapis.com/auth/cloud-platform
    ## set via options("bigQueryR.scopes") and reauthenticate if needed

    download_url <- bqr_grant_extract_access(job_extract, "[email protected]")

    ## download_url may be multiple if the data is > 1GB
    > [1] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000000.csv"
    > [2] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000001.csv"
    > [3] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000002.csv"
like image 127
Hao Zhang Avatar answered Oct 03 '22 12:10

Hao Zhang


Not a complete answer, but the details of my research so far to save people from retracing my steps.

Timeout does not appear controllable via dbplyr

  • dbconnect accepts both the driver and arguments to pass to the driver (documentation): dbConnect(drv, ...).
  • For some connection types the additional arguments can include timeout. This issue has an example using Cassandra: con <- dbConnect(odbc::odbc(), "Cassandra (DSN)", timeout = 10).
  • However timeout does not appear to be supported as an argument for bigquery. The documentation lists the following arguments (project, dataset, billing, page_size, quiet, use_legacy_sql, bigint) and notes that other arguments are currently ignored.

So given the above, it seems unlikely that the timeout can be controlled by R via dbplyr, DBI or the connection.

Splitting the query into multiple shorter queries

While not a preference of OP (comments make it clear) this is still a potential solution. I use an approach of filtering based on an unique ID column, with a wrapper function to reduce some of the additional clutter:

reconnect <- function(jj){
  if(exists("connection"))
    dbDisconnect(connection) # avoids multiple concurrent connections

  print(paste0(as.character(Sys.time()),"-- modulo ",jj," begun")) # track progress

  connection <- dbConnect(
    bigrquery::bigquery(),
    project = "your-gcp-project-id",
    dataset = "dataset-name",
    billing = billing
  )

  mytable <- tbl(connection, "mytable") %>%
    filter(unique_id %% NUM_SUBSETS == jj) # filter to subset, requires unique_id

  # assignment into the parent environment
  assign("connection", connection, envir = parent.frame())
  assign("mytable ", mytable , envir = parent.frame())
}

We then iterate as follows:

## parameters
DEVELOPMENT_MODE = FALSE
NUM_SUBSETS = 50

## subset
modulo = if(DEVELOPMENT_MODE){ modulo = 0 # only one if development mode
} else { modulo = 0:(NUM_SUBSETS-1) # otherwise all of them
}

results = data.frame()

for(jj in modulo){
  reconnect(jj)

  these_results = mytable %>%
    -- some heavy dplyr wrangling --
    %>% collect()

  results = rbind(results, these_results)
}

I set DEVELOPER_MODE to true when testing/developing, and to false when I want the entire thing to run.

Other avenues to consider

  • Check whether timeout can be set/controlled within bigquery account (if it can not be controlled via R).
  • Investigation of how complex -- heavy dplyr wrangling here -- is. Because dbplyr does not translate very efficient sql code, in my work on SQL server, saving intermediate tables has cut hours off my runtimes. Given that downloading 10GB should be much faster than several hours, the bottleneck could be bigquery conducting all the wrangling on the fly (and that the initial 20 second execution is with lazy evaluation). This link suggests there is a six hour limit on the duration of a single execution.
like image 28
Simon.S.A. Avatar answered Oct 03 '22 13:10

Simon.S.A.