How to force parquet dtypes when saving pd.DataFrame?

Question

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a single dask.dataframe.

Trying to cast the pandas column using df.column_name = df.column_name.astype(sometype) didn't work.

Why I'm asking this

I want to load many parquet files into a single dask.dataframe. All files were generated from as many instances of pd.DataFrame, using df.to_parquet(filename). All dataframes have the same columns, but for some a given column might contain only null values. When trying to load all files into the dask.dataframe (using df = dd.read_parquet('*.parquet') , I get the following error:

Schema in filename.parquet was different.
id: int64
text: string
[...]
some_column: double

vs

id: int64
text: string
[...]
some_column: null

Steps to reproduce my problem

import pandas as pd
import dask.dataframe as dd
a = pd.DataFrame(['1', '1'], columns=('value',))
b = pd.DataFrame([None, None], columns=('value',))
a.to_parquet('a.parquet')
b.to_parquet('b.parquet')
df = dd.read_parquet('*.parquet')  # Reads a and b

This gives me the following:

ValueError: Schema in path/to/b.parquet was different. 
value: null
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "value", "field_name": "value", "pandas_type": "empty'
            b'", "numpy_type": "object", "metadata": null}, {"name": null, "fi'
            b'eld_name": "__index_level_0__", "pandas_type": "int64", "numpy_t'
            b'ype": "int64", "metadata": null}], "pandas_version": "0.22.0"}'}

vs

value: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "value", "field_name": "value", "pandas_type": "unico'
            b'de", "numpy_type": "object", "metadata": null}, {"name": null, "'
            b'field_name": "__index_level_0__", "pandas_type": "int64", "numpy'
            b'_type": "int64", "metadata": null}], "pandas_version": "0.22.0"}'}

Notice how in one case we have "pandas_type": "unicode" and in the other we have "pandas_type": "empty".

Related questions that didn't provide me with a solution

How to specify logical types when writing Parquet files from PyArrow?

mdurant · Accepted Answer

If you instead use fastparquet, you can achieve chat you want

import pandas as pd
import dask.dataframe as dd
a = pd.DataFrame(['1', '1'], columns=('value',))
b = pd.DataFrame([None, None], columns=('value',))
a.to_parquet('a.parquet', object_encoding='int', engine='fastparquet')
b.to_parquet('b.parquet', object_encoding='int', engine='fastparquet')

dd.read_parquet('*.parquet').compute()

gives

   value
0    1.0
1    1.0
0    NaN
1    NaN

How to force parquet dtypes when saving pd.DataFrame?

Tags:

python

pandas

parquet

dask

pyarrow

HugoMailhot

1 Answers

mdurant

Recent Activity

Donate For Us