Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently write a Pandas dataframe to Google BigQuery

I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table',                   'my_project_id',                  chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)                  if_exists='append',                  verbose=False                  )  dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds  # manually upload the file into GCS GUI print(dataframe.shape) (363364, 21) 

My question is, what is faster?

  1. Upload Dataframe using pandas.DataFrame.to_gbq() function
  2. Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
  3. Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

Update:

Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):     bigquery_client = bigquery.Client()     dataset_ref = bigquery_client.dataset(dataset_id)     table_ref = dataset_ref.table(table_id)          with open(source_file_name, 'rb') as source_file:         # This example uses CSV, but you can use other formats.         # See https://cloud.google.com/bigquery/loading-data         job_config = bigquery.LoadJobConfig()         job_config.source_format = 'text/csv'         job_config.autodetect=True         job = bigquery_client.load_table_from_file(             source_file, table_ref, job_config=job_config)      job.result()  # Waits for job to complete      print('Loaded {} rows into {}:{}.'.format(         job.output_rows, dataset_id, table_id)) 
like image 627
Pablo Avatar asked Feb 20 '18 13:02

Pablo


People also ask

What makes BigQuery fast?

Due to the separation between compute and storage layers, BigQuery requires an ultra-fast network which can deliver terabytes of data in seconds directly from storage into compute for running Dremel jobs. Google's Jupiter network enables BigQuery service to utilize 1 Petabit/sec of total bisection bandwidth.


1 Answers

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context import datalab.storage as storage import datalab.bigquery as bq import pandas as pd from pandas import DataFrame import time  # Dataframe to write my_data = [{1,2,3}] for i in range(0,100000):     my_data.append({1,2,3}) not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])  #Alternative 1 start = time.time() not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',                   Context.default().project_id,                  chunksize=10000,                   if_exists='append',                  verbose=False                  ) end = time.time() print("time alternative 1 " + str(end - start))  #Alternative 3 start = time.time() sample_bucket_name = Context.default().project_id + '-datalab-example' sample_bucket_path = 'gs://' + sample_bucket_name sample_bucket_object = sample_bucket_path + '/Hello.txt' bigquery_dataset_name = 'TestDataSet' bigquery_table_name = 'TestTable'  # Define storage bucket sample_bucket = storage.Bucket(sample_bucket_name)  # Create or overwrite the existing table if it exists table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)  # Write the DataFrame to GCS (Google Cloud Storage) %storage write --variable not_so_simple_dataframe --object $sample_bucket_object  # Write the DataFrame to a BigQuery table table.insert_data(not_so_simple_dataframe) end = time.time() print("time alternative 3 " + str(end - start)) 

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3 10000   30.72s         8.14s 100000  162.43s        70.64s 1000000 1473.57s       688.59s 

Judging from the results, alternative 3 is faster than alternative 1.

like image 154
enle lin Avatar answered Sep 28 '22 12:09

enle lin