Airflow BigQueryOperator: how to save query result in a partitioned Table?

Tags:

google-bigquery

I have a simple DAG

from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator

with DAG(dag_id='my_dags.my_dag') as dag:

    start = DummyOperator(task_id='start')

    end = DummyOperator(task_id='end')
    sql = """
             SELECT *
             FROM 'another_dataset.another_table'
          """
    bq_query = BigQueryOperator(bql=sql,
                            destination_dataset_table='my_dataset.my_table20180524'),
                            task_id='bq_query',
                            bigquery_conn_id='my_bq_connection',
                            use_legacy_sql=False,
                            write_disposition='WRITE_TRUNCATE',
                            create_disposition='CREATE_IF_NEEDED',
                            query_params={})
    start >> bq_query >> end

When executing the bq_query task the SQL query gets saved in a sharded table. I want it to get saved in a daily partitioned table. In order to do so, I only changed destination_dataset_table to my_dataset.my_table$20180524. I got the error below when executing the bq_task:

Partitioning specification must be provided in order to create partitioned table

How can I specify to BigQuery to save query result to a daily partitioned table ? my first guess has been to use query_params in BigQueryOperator but I didn't find any example on how to use that parameter.

EDIT:

I'm using google-cloud==0.27.0 python client ... and it's the one used in Prod :(

751

asked May 24 '18 08:05

MassyB

1 Answers

You first need to create an Empty partitioned destination table. Follow instructions here: link to create an empty partitioned table

and then run below airflow pipeline again. You can try code:

import datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
today_date = datetime.datetime.now().strftime("%Y%m%d")
table_name = 'my_dataset.my_table' + '$' + today_date
with DAG(dag_id='my_dags.my_dag') as dag:
    start = DummyOperator(task_id='start')
    end = DummyOperator(task_id='end')
    sql = """
         SELECT *
         FROM 'another_dataset.another_table'
          """
    bq_query = BigQueryOperator(bql=sql,
                        destination_dataset_table={{ params.t_name }}),
                        task_id='bq_query',
                        bigquery_conn_id='my_bq_connection',
                        use_legacy_sql=False,
                        write_disposition='WRITE_TRUNCATE',
                        create_disposition='CREATE_IF_NEEDED',
                        query_params={'t_name': table_name},
                        dag=dag
                        )
start >> bq_query >> end

So what I did is that I created a dynamic table name variable and passed to the BQ operator.

answered Nov 16 '22 18:11

gruby

Related questions
                            
                                What causes "resources exceeded" in BigQuery?
                            
                                Export Google BigQuery data to Python Pandas dataframe
                            
                                BigQuery API limit exceeded error
                            
                                BigQuery select multiple key values
                            
                                Apps Script, convert a Sheet range to Blob
                            
                                Need help formatting datetime timezone for Google API
                            
                                How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?
                            
                                BigQuery Standard SQL: Delete Duplicates from Table
                            
                                Python Unit Testing Google Bigquery
                            
                                Resources exceeded BigQuery
                            
                                Unable to use json body of gcp cloud scheduler in cloud function as parameter value?
                            
                                Obtaining BigQuery data from JavaScript code
                            
                                BigQuery Subtract Counts of Two Tables?
                            
                                How to use bigquery correlation based on many columns?
                            
                                How to scale Pivoting in BigQuery?
                            
                                SHA-256 BigQuery function or UDF
                            
                                How to change default Options in BigQuery console (Web UI), especially uncheck "Use Legacy SQL"?
                            
                                Bigquery: Partitioning data past 2000 limit (Update: Now 4000 limit) [duplicate]
                            
                                Convert Bigquery results to Pandas Data Frame
                            
                                Are some bigquery public datasets no longer available?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With