google-cloud-bigquery
version: 2.8.0I'm provisioning a dataproc cluster which gets the data from BigQuery into a pandas dataframe. As my data is growing I was looking to boost the performance and heard about using the BigQuery storage client.
I had the same problem in the past and this was solved by setting the google-cloud-bigquery to version 1.26.1. If I use that version I get the following message.
/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
"Cannot create BigQuery Storage client, the dependency "
The code snippet executes but at a way slower rate. If I do not specify the pip version, I encounter this error.
gcloud dataproc clusters create testing-cluster --region=europe-west1 --zone=europe-west1-b --master-machine-type n1-standard-16 --single-node --image-version 1.5-debian10 --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh --metadata 'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq'
bqclient = bigquery.Client(project=project)
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("query_start", "STRING", str('2021-02-09 00:00:00')),
bigquery.ScalarQueryParameter("query_end", "STRING", str('2021-02-09 23:59:59.99')),
]
)
df = bqclient.query(query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
2021-02-11 10:10:14,069 - preprocessing logger initialized
2021-02-11 10:10:14,069 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
Traceback (most recent call last):
File "/tmp/782503bcc80246258560a07d2179891f/immo_preprocessing-pageviews_kyero.py", line 104, in <module>
df = bqclient.query(base_query, job_config=job_config).to_dataframe(create_bqstorage_client=True)
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1333, in to_dataframe
date_as_object=date_as_object,
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
Using the pandas-gbq version gives exaclty the same error
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'query_start',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': str('2021-02-09 00:00:00')}
},
{
'name': 'query_end',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': str('2021-02-09 23:59:59.99')}
},
]
}
}
df = pd.read_gbq(base_query,
configuration=query_config,
progress_bar_type='tqdm',
use_bqstorage_api=True)
2021-02-11 09:21:19,532 - preprocessing logger initialized
2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
started
Downloading: 100%|██████████| 3107858/3107858 [00:14<00:00, 207656.33rows/s]
Traceback (most recent call last):
File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>
use_bqstorage_api=True)
File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq
**kwargs,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq
dtypes=dtypes,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query
user_dtypes=dtypes,
File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results
**to_dataframe_kwargs
File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'
https://github.com/googleapis/python-bigquery/issues/519
@Sam answered this, but I thought I'd just mention the actionable commands:
In a Jupyter notebook:
!pip install pyarrow==3.0.0
In your virtualenv
pip install pyarrow==3.0.0
Dataproc installs by default pyarrow 0.15.0 while the bigquery-storage-api needs a more recent version. Manually setting pyarrow to 3.0.0 at install solved the issue. That being said, PySpark has a compability setting for Pyarrow >= 0.15.0 https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark I've taken a look at the release notes of dataproc and this env variable is set as default since May 2020.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With