Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Takes too long to export data from bigquery into Jupyter notebook

In the Jupyter Notebook, I am trying to import data from BigQuery using an sql-like query on the BigQuery server. I then store the data in a dataframe:

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
from google.cloud import bigquery

sql = """
SELECT * FROM dataset.table
"""
client = bigquery.Client()
df_bq = client.query(sql).to_dataframe()

The data has the shape (6000000, 8) and uses about 350MB of memory once stored in the dataframe.

The query sql, if executed directly in BQ, takes about 2 seconds.

However, it usually takes about 30-40 minutes to execute the code above, and more often than not the code fails to execute raising the following error:

ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))

All in all, there could be three reasons for the error:

  1. It takes the BigQuery server a long time to execute the query
  2. It takes a long time to transfer data (I don't understand why a 350MB file should take 30min to be sent over the network. I tried using a LAN connection to eliminate server cuts and maximize throughput, which didn't help)
  3. It takes a long time to set a dataframe with the data from BigQuery

Would be happy to gain any insight into the problem, thanks in advance!

like image 342
Max Sfnv Avatar asked Nov 22 '18 14:11

Max Sfnv


People also ask

What is the name of library needs to be installed to access BigQuery data in notebook?

The BigQuery client library for Python is automatically installed in a managed notebook. Behind the scenes, the %%bigquery magic command uses the BigQuery client library for Python to run the given query, convert the results to a pandas DataFrame, optionally save the results to a variable, and then display the results.


2 Answers

Use bigquery storage to get large data queries from bigquery into a pandas dataframe really fast.

Working code snippet:

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage

# uncomment this part if you are working locally
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="your_json_key.json"

# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)

# define your query
your_query = """select * from your_big_query_table"""

# set you bqstorage_client as argument in the to_dataframe() method.
# i've also added the tqdm progress bar here so you get better insight
# into how long it's still going to take
dataframe = (
    bqclient.query(your_query)
            .result()
            .to_dataframe(
                bqstorage_client=bqstorageclient,
                progress_bar_type='tqdm_notebook',)
)

You can find more on how to use bigquery storage here:
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

like image 66
Sander van den Oord Avatar answered Oct 01 '22 06:10

Sander van den Oord


Try using BigQuery Storage API - it's blazing fast for downloading large tables as pandas dataframes

https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

like image 25
user983302 Avatar answered Oct 01 '22 06:10

user983302