Efficiently write a Pandas dataframe to Google BigQuery

Tags:

I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table',                   'my_project_id',                  chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)                  if_exists='append',                  verbose=False                  )  dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds  # manually upload the file into GCS GUI print(dataframe.shape) (363364, 21)

My question is, what is faster?

Upload Dataframe using pandas.DataFrame.to_gbq() function
Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

Update:

Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):     bigquery_client = bigquery.Client()     dataset_ref = bigquery_client.dataset(dataset_id)     table_ref = dataset_ref.table(table_id)          with open(source_file_name, 'rb') as source_file:         # This example uses CSV, but you can use other formats.         # See https://cloud.google.com/bigquery/loading-data         job_config = bigquery.LoadJobConfig()         job_config.source_format = 'text/csv'         job_config.autodetect=True         job = bigquery_client.load_table_from_file(             source_file, table_ref, job_config=job_config)      job.result()  # Waits for job to complete      print('Loaded {} rows into {}:{}.'.format(         job.output_rows, dataset_id, table_id))

627

asked Feb 20 '18 13:02

Pablo

1 Answers

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context import datalab.storage as storage import datalab.bigquery as bq import pandas as pd from pandas import DataFrame import time  # Dataframe to write my_data = [{1,2,3}] for i in range(0,100000):     my_data.append({1,2,3}) not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])  #Alternative 1 start = time.time() not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',                   Context.default().project_id,                  chunksize=10000,                   if_exists='append',                  verbose=False                  ) end = time.time() print("time alternative 1 " + str(end - start))  #Alternative 3 start = time.time() sample_bucket_name = Context.default().project_id + '-datalab-example' sample_bucket_path = 'gs://' + sample_bucket_name sample_bucket_object = sample_bucket_path + '/Hello.txt' bigquery_dataset_name = 'TestDataSet' bigquery_table_name = 'TestTable'  # Define storage bucket sample_bucket = storage.Bucket(sample_bucket_name)  # Create or overwrite the existing table if it exists table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)  # Write the DataFrame to GCS (Google Cloud Storage) %storage write --variable not_so_simple_dataframe --object $sample_bucket_object  # Write the DataFrame to a BigQuery table table.insert_data(not_so_simple_dataframe) end = time.time() print("time alternative 3 " + str(end - start))

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3 10000   30.72s         8.14s 100000  162.43s        70.64s 1000000 1473.57s       688.59s

Judging from the results, alternative 3 is faster than alternative 1.

154

answered Sep 28 '22 12:09

enle lin

Related questions
                            
                                Evaluate math equations from unsafe user input in Python
                            
                                You must feed a value for placeholder tensor 'Placeholder' with dtype float
                            
                                Twisted or Celery? Which is right for my application with lots of SOAP calls?
                            
                                validating JSON from command line using `python -m jsontool` gives 'No JSON object could be decoded'
                            
                                Python: is RuntimeError acceptable for general use?
                            
                                Python line-by-line memory profiler?
                            
                                Python how to do multiprocessing inside of a class?
                            
                                Problems using MySQL with AWS Lambda in Python
                            
                                Extremely slow model load with keras
                            
                                Is there SQLAlchemy automigration tool like South for Django?
                            
                                Specify the dimensions of a Tkinter text box in pixels
                            
                                pip -e: No magic underscore to dash replacement
                            
                                pip install PIL -E TICKETS-1 - No JPEG/PNG support
                            
                                python - How select.select() works?
                            
                                Future-compatible enums in 2.7?
                            
                                Is it possible to do additive blending with matplotlib?
                            
                                Python class scoping rules
                            
                                Right way to return proxy model instance from a base model instance in Django?
                            
                                Relative imports in Python
                            
                                Handling instances of a context manager inside another context manager

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently write a Pandas dataframe to Google BigQuery

Tags:

python

pandas

google-cloud-storage

google-bigquery

google-cloud-python

Pablo

People also ask

1 Answers

enle lin

Recent Activity

Donate For Us