Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select from multiple table within a dataset in Big Query using dplyr and bigrquery?

I'm trying to query multiple tables from a dataset in Big Query using dplyr and bigrquery. The dataset holds multiple tables, one for each day of data in a year. I can query from a single table (e.g., 1 day of data) with the following code but cant seem to make it work across multiple tables at once (e.g., for a month or year of data) . Any help would be greatly appreciated.

connection <- src_bigquery("my_project", "dataset1")
first_day <- connection %>%
    tbl("20150101") %>% 
    select(field1) %>%
    group_by(field1) %>%
    summarise(number = n()) %>%
    arrange(desc(number))

Thank you,

Juan

like image 984
JuanMayorga Avatar asked Oct 30 '22 04:10

JuanMayorga


1 Answers

As far as I know there is no support for table wildcard functions in dplyr and bigrquery at the moment. If you don't fear ugly hacks you can however extract and edit the query that dplyr builds and sends to bq so that it points to several tables instead of just one.

Set your billing information and connect to BigQuery:

my_billing <- ##########
bq_db <- src_bigquery(
  project = "bigquery-public-data",
  dataset = "noaa_gsod",
  billing = my_billing
)
gsod <- tbl(bq_db, "gsod1929")

How to select from one table (just for comparison):

gsod %>%
  filter(stn == "030750") %>%
  select(year, mo, da, temp) %>%
  collect
Source: local data frame [92 x 4]

    year    mo    da  temp
   (chr) (chr) (chr) (dbl)
1   1929    10    01  45.2
2   1929    10    02  49.2
3   1929    10    03  48.2
4   1929    10    04  43.5
5   1929    10    05  42.0
6   1929    10    06  51.0
7   1929    10    07  48.0
8   1929    10    08  43.7
9   1929    10    09  45.1
10  1929    10    10  51.3
..   ...   ...   ...   ...

How to select from multiple tables by manually editing the query generated by dplyr:

multi_query <- gsod %>%
  filter(stn == "030750") %>%
  select(year, mo, da, temp) %>%
  dplyr:::build_query(.)

multi_tables <- paste("[bigquery-public-data:noaa_gsod.gsod", c(1929, 1930), "]",
                      sep = "", collapse = ", ")

query_exec(
  query = gsub("\\[gsod1929\\]", multi_tables, multi_query$sql),
  project = my_billing
) %>% tbl_df
Source: local data frame [449 x 4]

    year    mo    da  temp
   (chr) (chr) (chr) (dbl)
1   1930    06    11  51.8
2   1930    05    20  46.8
3   1930    05    21  48.5
4   1930    07    04  56.0
5   1930    08    08  54.5
6   1930    06    06  52.0
7   1930    01    14  36.8
8   1930    01    27  32.9
9   1930    02    08  35.6
10  1930    02    11  38.5
..   ...   ...   ...   ...

Validation of the results:

table(.Last.value$year)
1929 1930 
  92  357 
like image 95
Backlin Avatar answered Nov 15 '22 05:11

Backlin