Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Controlling file size while exporting data from bigquery to Google Cloud Storage

Tags:

I am working on exporting large dataset from bigquery to Goolge cloud storage into the compressed format. In Google cloud storage I have file size limitation( maximum file size 1GB each file). Therefore I am using split and compassion techniques to split data while exporting. The sample code is as follow:

gcs_destination_uri = 'gs://{}/{}'.format(bucket_name, 'wikipedia-*.csv.gz') 
gcs_bucket = storage_client.get_bucket(bucket_name)

# Job Config
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP

def bigquery_datalake_load():  
    dataset_ref = bigquery_client.dataset(dataset_id, project=project)
    table_ref = dataset_ref.table(table_id)
    table = bigquery_client.get_table(table_ref)  # API Request
    row_count = table.num_rows

    extract_job = bigquery_client.extract_table(
        table_ref,
        gcs_destination_uri,
        location='US',
        job_config=job_config)  # API request
    logging.info('BigQuery extract Started.... Wait for the job to complete.')
    extract_job.result()  # Waits for job to complete.

    print('Exported {}:{}.{} to {}'.format(
        project, dataset_id, table_id, gcs_destination_uri))
    # [END bigquery_extract_table]

This code is splitting the large dataset and compressing into .gz format but it is returning multiple compressed files which size is rounding between 40MB to 70MB.

I am trying to generate the compressed file with the size of 1GB (each file). Is there any way to get this done?

like image 894
Sandeep Singh Avatar asked Jun 20 '18 17:06

Sandeep Singh


People also ask

How can I export more than 16000 rows in BigQuery?

If your data has more than 16,000 rows you'd need to save the result of your query as a BigQuery Table. Afterwards, export the data from the table into Google Cloud Storage using any of the available options (such as the Cloud Console, API, bq or client libraries).

In which format does BigQuery save data?

BigQuery stores table data in columnar format, meaning it stores each column separately. Column-oriented databases are particularly efficient at scanning individual columns over an entire dataset.

Does BigQuery use cloud storage?

BigQuery supports querying Cloud Storage data in the following formats: Comma-separated values (CSV) JSON (newline-delimited) Avro.


1 Answers

Unfortunately no - Google adjust it by itself - you do not have options to specify size. I believe it is because of size of uncompressed data (so each BQ worker produced one file and it is impossible to produce one file from multiple workers)

like image 177
Alexey Maloletkin Avatar answered Oct 07 '22 03:10

Alexey Maloletkin