Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract BigQuery partitioned table

Is there a way to extract the complete BigQuery partitioned table with one command so that data of each partition is extracted into a separate folder of the format part_col=date_yyyy-mm-dd

Since Bigquery partitioned table can read files from the hive type partitioned directories, is there a way to extract the data in a similar way. I can extract each partition separately, however that is very cumbersome when i an extracting a lot of partitions

like image 862
Trishit Ghosh Avatar asked Jul 02 '19 14:07

Trishit Ghosh


People also ask

How would you query specific partitions in a BigQuery table?

Query Specific Partitions when you create a table partitioned by according to a TIMESTAMP or DATE column. Tables partitioned according to a TIMESTAMP or DATE column do not have pseudo-columns! To limit the number of partitions analyzed when querying partitioned tables, you can use a predicate filter (WHERE clause).

What is partitioned table BigQuery?

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.


1 Answers

You could do this programmatically. For instance, you can export partitioned data by using the partition decorator such as table$20190801. And then on the bq extract command you can use URI Patterns (look the example of the workers pattern) for the GCS objects.

Since all objects will be within the same bucket, the folders are just an hierarchical illusion, so you can specify URI patterns on the folders as well, but not on the bucket.

So you would do a script where you loop over the DATE value, with something like:

bq extract 
--destination_format [CSV, NEWLINE_DELIMITED_JSON, AVRO] 
--compression [GZIP, AVRO supports DEFLATE and SNAPPY] 
--field_delimiter [DELIMITER] 
--print_header [true, false] 
[PROJECT_ID]:[DATASET].[TABLE]$[DATE]
gs://[BUCKET]/part_col=[DATE]/[FILENAME]-*.[csv, json, avro]

You can't do it automatically with just a bq command. For this it would be better to raise a feature request as suggested by Felipe.

like image 171
Héctor Neri Avatar answered Sep 26 '22 13:09

Héctor Neri