When using bigrquery from R, the tidyverse API asks for authentication, which, once provided, allows bigquery to be accessed from the R session. Downloading results can be time consuming, and the if the oauth token expires mid transfer the download fails:
Error: Invalid Credentials [authError]
# install.packages("tidyverse"); install.packages('bigrquery')
Sys.setenv(BIGQUERY_TEST_PROJECT="your-gcp-project-id")
library(tidyverse)
library(dbplyr)
library(bigrquery)
billing <- bq_test_project()
connection <- dbConnect(
bigrquery::bigquery(),
project = "your-gcp-project-id",
dataset = "dataset-name",
billing = billing
)
mytable <- tbl(connection, "mytable")
mytable %>%
-- some heavy dplyr wrangling --
%>% collect()
Which returns
Running job [/] 20s
Complete
Billed 400GB
Downloading 230,000,000 rows
Downloading data [=====--------] 28% ETA: 24h
but after some time
Error: Invalid Credentials [authError]
How can the length of time before the 0Auth token expires be increased?
I also have the same error and I get a timeout of the issue.
Another potential solution is exporting to google data studio and then download it through a csv as a work-around.
Or for large datasets, this is best done via extracting the BigQuery result to Google Cloud Storage, then downloading the data from there, as per this article: https://mran.microsoft.com/web/packages/bigQueryR/vignettes/bigQueryR.html
## Create the data extract from BigQuery to Cloud Storage
job_extract <- bqr_extract_data("your_project",
"your_dataset",
"bigResultTable",
"your_cloud_storage_bucket_name")
## poll the extract job to check its status
## its done when job$status$state == "DONE"
bqr_get_job("your_project", job_extract$jobReference$jobId)
## to download via a URL and not logging in via Google Cloud Storage interface:
## Use an email that is Google account enabled
## Requires scopes:
## https://www.googleapis.com/auth/devstorage.full_control
## https://www.googleapis.com/auth/cloud-platform
## set via options("bigQueryR.scopes") and reauthenticate if needed
download_url <- bqr_grant_extract_access(job_extract, "[email protected]")
## download_url may be multiple if the data is > 1GB
> [1] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000000.csv"
> [2] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000001.csv"
> [3] "https://storage.cloud.google.com/big-query-r-extracts/extract-20160311112410-000000000002.csv"
Not a complete answer, but the details of my research so far to save people from retracing my steps.
dbconnect
accepts both the driver and arguments to pass to the driver (documentation): dbConnect(drv, ...)
.timeout
. This issue has an example using Cassandra: con <- dbConnect(odbc::odbc(), "Cassandra (DSN)", timeout = 10)
.bigquery
. The documentation lists the following arguments (project
, dataset
, billing
, page_size
, quiet
, use_legacy_sql
, bigint
) and notes that other arguments are currently ignored.So given the above, it seems unlikely that the timeout can be controlled by R via dbplyr
, DBI
or the connection.
While not a preference of OP (comments make it clear) this is still a potential solution. I use an approach of filtering based on an unique ID column, with a wrapper function to reduce some of the additional clutter:
reconnect <- function(jj){
if(exists("connection"))
dbDisconnect(connection) # avoids multiple concurrent connections
print(paste0(as.character(Sys.time()),"-- modulo ",jj," begun")) # track progress
connection <- dbConnect(
bigrquery::bigquery(),
project = "your-gcp-project-id",
dataset = "dataset-name",
billing = billing
)
mytable <- tbl(connection, "mytable") %>%
filter(unique_id %% NUM_SUBSETS == jj) # filter to subset, requires unique_id
# assignment into the parent environment
assign("connection", connection, envir = parent.frame())
assign("mytable ", mytable , envir = parent.frame())
}
We then iterate as follows:
## parameters
DEVELOPMENT_MODE = FALSE
NUM_SUBSETS = 50
## subset
modulo = if(DEVELOPMENT_MODE){ modulo = 0 # only one if development mode
} else { modulo = 0:(NUM_SUBSETS-1) # otherwise all of them
}
results = data.frame()
for(jj in modulo){
reconnect(jj)
these_results = mytable %>%
-- some heavy dplyr wrangling --
%>% collect()
results = rbind(results, these_results)
}
I set DEVELOPER_MODE
to true when testing/developing, and to false when I want the entire thing to run.
-- heavy dplyr wrangling here --
is. Because dbplyr does not translate very efficient sql code, in my work on SQL server, saving intermediate tables has cut hours off my runtimes. Given that downloading 10GB should be much faster than several hours, the bottleneck could be bigquery conducting all the wrangling on the fly (and that the initial 20 second execution is with lazy evaluation). This link suggests there is a six hour limit on the duration of a single execution.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With