In the Jupyter Notebook, I am trying to import data from BigQuery using an sql-like query on the BigQuery server. I then store the data in a dataframe:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="credentials.json"
from google.cloud import bigquery
sql = """
SELECT * FROM dataset.table
"""
client = bigquery.Client()
df_bq = client.query(sql).to_dataframe()
The data has the shape (6000000, 8) and uses about 350MB of memory once stored in the dataframe.
The query sql
, if executed directly in BQ, takes about 2 seconds.
However, it usually takes about 30-40 minutes to execute the code above, and more often than not the code fails to execute raising the following error:
ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')",))
All in all, there could be three reasons for the error:
Would be happy to gain any insight into the problem, thanks in advance!
The BigQuery client library for Python is automatically installed in a managed notebook. Behind the scenes, the %%bigquery magic command uses the BigQuery client library for Python to run the given query, convert the results to a pandas DataFrame, optionally save the results to a variable, and then display the results.
Use bigquery storage to get large data queries from bigquery into a pandas dataframe really fast.
Working code snippet:
import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage
# uncomment this part if you are working locally
# import os
# os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="your_json_key.json"
# Explicitly create a credentials object. This allows you to use the same
# credentials for both the BigQuery and BigQuery Storage clients, avoiding
# unnecessary API calls to fetch duplicate authentication tokens.
credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Make clients.
bqclient = bigquery.Client(credentials=credentials, project=your_project_id,)
bqstorageclient = bigquery_storage.BigQueryReadClient(credentials=credentials)
# define your query
your_query = """select * from your_big_query_table"""
# set you bqstorage_client as argument in the to_dataframe() method.
# i've also added the tqdm progress bar here so you get better insight
# into how long it's still going to take
dataframe = (
bqclient.query(your_query)
.result()
.to_dataframe(
bqstorage_client=bqstorageclient,
progress_bar_type='tqdm_notebook',)
)
You can find more on how to use bigquery storage here:
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas
Try using BigQuery Storage API - it's blazing fast for downloading large tables as pandas dataframes
https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With