Takes too long to export data from bigquery into Jupyter notebook

Tags:

In the Jupyter Notebook, I am trying to import data from BigQuery using an sql-like query on the BigQuery server. I then store the data in a dataframe:

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
from google.cloud import bigquery

sql = """
SELECT * FROM dataset.table
"""
client = bigquery.Client()
df_bq = client.query(sql).to_dataframe()

The data has the shape (6000000, 8) and uses about 350MB of memory once stored in the dataframe.

The query sql, if executed directly in BQ, takes about 2 seconds.

However, it usually takes about 30-40 minutes to execute the code above, and more often than not the code fails to execute raising the following error:

ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))

All in all, there could be three reasons for the error:

It takes the BigQuery server a long time to execute the query
It takes a long time to transfer data (I don't understand why a 350MB file should take 30min to be sent over the network. I tried using a LAN connection to eliminate server cuts and maximize throughput, which didn't help)
It takes a long time to set a dataframe with the data from BigQuery

Would be happy to gain any insight into the problem, thanks in advance!

342

asked Nov 22 '18 14:11

Max Sfnv

2 Answers

Use bigquery storage to get large data queries from bigquery into a pandas dataframe really fast.

Working code snippet:

import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage

# uncomment this part if you are working locally
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="your_json_key.json"

# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
    scopes=["https://www.googleapis.com/auth/cloud-platform"]
)

# Make clients.
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)

# define your query
your_query = """select * from your_big_query_table"""

# set you bqstorage_client as argument in the to_dataframe() method.
# i've also added the tqdm progress bar here so you get better insight
# into how long it's still going to take
dataframe = (
    bqclient.query(your_query)
            .result()
            .to_dataframe(
                bqstorage_client=bqstorageclient,
                progress_bar_type='tqdm_notebook',)
)

You can find more on how to use bigquery storage here:
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

answered Oct 01 '22 06:10

Sander van den Oord

Try using BigQuery Storage API - it's blazing fast for downloading large tables as pandas dataframes

https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

answered Oct 01 '22 06:10

user983302

Related questions
                            
                                Getting unencoded data from Google cloud Pub/Sub instead of base64
                            
                                Using pipes to capture things printed to STDERR into Python variable from Jupyter
                            
                                How to use TensorFlow in OOP style?
                            
                                Activating Python Virtual Environment in Atom
                            
                                sqlalchemy.exc.InvalidRequestError: Mapper '...' has no property '...'
                            
                                Pandas - Extracting value to basic python float
                            
                                Side-by-side boxplot of multiple columns of a pandas DataFrame
                            
                                Mnist recognition using keras
                            
                                pprint sorting dicts but not sets?
                            
                                Matrix exponentiation with scipy: expm, expm2 and expm3
                            
                                Implementing im2col in TensorFlow
                            
                                Add checkbox and delete actions to customized Django admin change_list
                            
                                Django Celery results set task id to something human readable?
                            
                                Thread._wait_for_tstate_lock() never returns
                            
                                How to make keras in R use the tensorflow installed by Python
                            
                                Implementing Photoshop High Pass Filter (HPF) in OpenCV
                            
                                saving Base64ImageField Type using Django Rest saves it as Raw image. How do I convert it to a normal image
                            
                                AttributeError: module '' has no attribute '__path__'
                            
                                Keras - Add attention mechanism to an LSTM model [duplicate]
                            
                                How can I show the code that is generated when using @dataclass class decorator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Takes too long to export data from bigquery into Jupyter notebook

Tags:

python

dataframe

jupyter-notebook

jupyter

google-bigquery

Max Sfnv

People also ask

2 Answers

Sander van den Oord

user983302

Recent Activity

Donate For Us