I am converting data from CSV to Parquet using Python (Pandas) to later load it into Google BigQuery. I have some integer columns that contain missing values and since Pandas 0.24.0 I can store them as Int64 dtype.
Is there a way to use Int64 dtype also in a parquet file? I can't find a clean solution for ints with missing values (so they stay as INTEGER in BigQuery).
I have tried importing it directly to BigQuery and got the same error as when converting to parquet using Pandas (as shown below.)
Import a CSV with int column that includes a missing value:
import pandas as pd
df = pd.read_csv("docs/test_file.csv")
print(df["id"].info())
id 8 non-null float64
The line is imported as float64. I change the type to Int64:
df["id"] = df["id"].astype('Int64')
print(df["id"].info())
id 8 non-null Int64
Then I try and save to parquet:
df.to_parquet("output/test.parquet")
The error:
pyarrow.lib.ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column id with type Int64')
There's currently an open issue to support the new Int64 column from google-cloud-bigquery at https://github.com/googleapis/google-cloud-python/issues/7702.
In the meantime, I suggest using the object dtype. In version 1.13.0 of google-cloud-bigquery, you can specify the desired BigQuery schema, and the library will use the desired types in the parquet file.
# Schema with all scalar types.
table_schema = (
bigquery.SchemaField("int_col", "INTEGER"),
)
num_rows = 100
nulls = [None] * num_rows
dataframe = pandas.DataFrame(
{
"int_col": nulls,
}
)
table_id = "{}.{}.load_table_from_dataframe_w_nulls".format(
Config.CLIENT.project, dataset_id
)
job_config = bigquery.LoadJobConfig(schema=table_schema)
load_job = Config.CLIENT.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
load_job.result()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With