I'm working on a script where I'm sending a dataframe to BigQuery:
load_job = bq_client.load_table_from_dataframe(
df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)
# Wait for the load job to complete
return load_job.result()
This is working fine but only if a schema has already been defined in BigQuery or if I'm defining the schema of my job in my script. If no schema has been defined I have the following error:
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'
Why is pyarrow
generating this error? How can I solve it other than pre-defining schema?
i was getting the same error
when i inspect the dataframe i saw a value like this: 2021-09-30 23:59:59.999999998
You probably have a mismatch in your date fields with the bigquery default. then i used this code:
df['date_column'] =df['date_column'].astype('datetime64[s]')
then the problem was solved for me.
The default behavior when converting to Arrow or Parquet from pandas is to not allow silent data loss. There are options to set when doing the conversion to allow permit unsafe casts causing loss of timestamp precision or other forms of data loss. The BigQuery Python API would need to set these options, so it may be a bug in the BigQuery library. I suggest reporting on their issue tracker https://github.com/googleapis/google-cloud-python
The solution for me was to add the following kwargs to to_parquet:
parquet_args = {
'coerce_timestamps': 'us',
'allow_truncated_timestamps': True,
}
You have to set both of them. If you set just allow_truncated_timestamps
, it will still raise the error if coerce_timestamps
is None
. I think the idea is that you only want to suppress the error if you're explicitly asking for coercion. Anyway, the docs are clear about it, but this behavior wasn't obvious to me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With