Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.

Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a

"timePartitioning": { "type": "DAY" }

option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.

I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?

This seems like a similar issue to setting the table expiration time which isn't currently available either.

like image 883
ptf Avatar asked Jun 30 '16 05:06

ptf


People also ask

Can you partition in BigQuery?

You can partition BigQuery tables by: Time-unit column: Tables are partitioned based on a TIMESTAMP , DATE , or DATETIME column in the table. Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data. Integer range: Tables are partitioned based on an integer column.

How would you query specific partitions in a BigQuery table?

Query Specific Partitions when you create a table partitioned by according to a TIMESTAMP or DATE column. Tables partitioned according to a TIMESTAMP or DATE column do not have pseudo-columns! To limit the number of partitions analyzed when querying partitioned tables, you can use a predicate filter (WHERE clause).

What is the difference between partitioning and clustering in BigQuery?

Like clustering, partitioning uses user-defined partition columns to specify how data is partitioned and what data is stored in each partition. Unlike clustering, partitioning provides granular query cost estimates before you run a query.


1 Answers

As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?

The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:

/**
 * A Joda-time formatter that prints a date in format like {@code "20160101"}.
 * Threadsafe.
 */
private static final DateTimeFormatter FORMATTER =
    DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);

// This code generates a valid BigQuery partition name:
Instant instant = Instant.now(); // any Joda instant in a reasonable time range
String baseTableName = "project:dataset.table"; // a valid BigQuery table name
String partitionName =
    String.format("%s$%s", baseTableName, FORMATTER.print(instant));
like image 51
Dan Halperin Avatar answered Sep 28 '22 06:09

Dan Halperin