I am trying to use Pandas and Pyarrow to parquet data. I have hundreds of parquet files that don't need to have the same schema but if columns match across parquets they must have the same data type.
I'm getting into situations where the resulting parquet data types are not what I want them to be. For example, I may write an int64
to a column and the resulting parquet will be in double
format. This is causing a lot of trouble on the processing side where 99% of the data is typed correctly but in 1% of cases it's just the wrong type.
I've tried importing numpy and wrapping the values this way-
import numpy as np
pandas.DataFrame({
'a': [ np.int64(5100), np.int64(5200), np.int64(5300) ]
})
But i'm still getting the occasional double so this must be the wrong way to do it. How can I ensure data types are consistent across columns across parquet files?
Update-
I found this only happens when the column contains one or more None
s.
data_frame = pandas.DataFrame({
'a': [ None, np.int64(5200), np.int64(5200) ]
})
Can parquet not handle mixed None-int64 cols?
Parquet files use a small number of primitive (or physical) data types. The logical types extend the physical types by specifying how they should be interpreted. Parquet data types not covered here are not supported for reading from or writing to Parquet files (JSON, BSON, binary, and so on).
Pandas uses other names for data types than Python, for example: object for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype . Make conscious decisions about how to manage missing data.
The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns], and object. In addition these dtypes have item sizes, e.g. int64 and int32. By default integer types are int64 and float types are float64, REGARDLESS of platform (32-bit or 64-bit).
Because Parquet is an open-source format, there are many different libraries and engines that can be used to read and write the data. Pandas allows you to customize the engine used to read the data from the file if you know which library is best.
Pandas itself cannot handle null/na values in integer columns at the moment (version 0.23.x). In the next release, there will be a nullable integer type. In the meantime, once you have a null value in an integer column, Pandas automatically converts this into a float column. Thus you also have a float column in your resulting Parquet file:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [np.int64(5100), np.int64(5200), np.int64(5300)]
})
# df['a'].dtype == dtype('int64')
df = pd.DataFrame({
'a': [None, np.int64(5200), np.int64(5200)]
})
# df['a'].dtype == dtype('float64')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With