I am working on exporting large dataset from bigquery to Goolge cloud storage into the compressed format. In Google cloud storage I have file size limitation( maximum file size 1GB each file). Therefore I am using split and compassion techniques to split data while exporting. The sample code is as follow:
gcs_destination_uri = 'gs://{}/{}'.format(bucket_name, 'wikipedia-*.csv.gz')
gcs_bucket = storage_client.get_bucket(bucket_name)
# Job Config
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
def bigquery_datalake_load():
dataset_ref = bigquery_client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
table = bigquery_client.get_table(table_ref) # API Request
row_count = table.num_rows
extract_job = bigquery_client.extract_table(
table_ref,
gcs_destination_uri,
location='US',
job_config=job_config) # API request
logging.info('BigQuery extract Started.... Wait for the job to complete.')
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, gcs_destination_uri))
# [END bigquery_extract_table]
This code is splitting the large dataset and compressing into .gz
format but it is returning multiple compressed files which size is rounding between 40MB to 70MB.
I am trying to generate the compressed file with the size of 1GB (each file). Is there any way to get this done?
If your data has more than 16,000 rows you'd need to save the result of your query as a BigQuery Table. Afterwards, export the data from the table into Google Cloud Storage using any of the available options (such as the Cloud Console, API, bq or client libraries).
BigQuery stores table data in columnar format, meaning it stores each column separately. Column-oriented databases are particularly efficient at scanning individual columns over an entire dataset.
BigQuery supports querying Cloud Storage data in the following formats: Comma-separated values (CSV) JSON (newline-delimited) Avro.
Unfortunately no - Google adjust it by itself - you do not have options to specify size. I believe it is because of size of uncompressed data (so each BQ worker produced one file and it is impossible to produce one file from multiple workers)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With