Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a pandas DataFrame with custom types using pyarrow and parquet

I want to save a pandas DataFrame to parquet, but I have some unsupported types in it (for example bson ObjectIds).

Throughout the examples we use:

import pandas as pd
import pyarrow as pa

Here's a minimal example to show the situation:

df = pd.DataFrame(
    [
        {'name': 'alice', 'oid': ObjectId('5e9992543bfddb58073803e7')},
        {'name': 'bob',   'oid': ObjectId('5e9992543bfddb58073803e8')},
    ]
)

df.to_parquet('some_path')

And we get:

ArrowInvalid: ('Could not convert 5e9992543bfddb58073803e7 with type ObjectId: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column oid with type object')

I tried to follow this reference: https://arrow.apache.org/docs/python/extending_types.html

Thus I wrote the following type extension:

class ObjectIdType(pa.ExtensionType):

    def __init__(self):
        pa.ExtensionType.__init__(self, pa.binary(12), "my_package.objectid")

    def __arrow_ext_serialize__(self):
        # since we don't have a parametrized type, we don't need extra
        # metadata to be deserialized
        return b''

    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return ObjectId()

And was able to get a working pyarray for my oid column:

values = df['oid']
storage_array = pa.array(values.map(lambda oid: oid.binary), type=pa.binary(12))
pa.ExtensionArray.from_storage(objectid_type, storage_array)

Now where I’m stuck, and cannot find any good solution on the internet, is how to save my df to parquet, letting it interpret which column needs which Extension. I might change columns in the future, and I have several different types that need this treatment.

How can I simply create parquet file from dataframes and restore them while transparently converting the types ?

I tried to create a pyarrow.Table object, and append columns to it after preprocessing, but it doesn’t work as table.append_column takes binary columns and not pyarrow.Arrays, plus the whole isinstance thing looks like a terrible solution.

table = pa.Table.from_pandas(pd.DataFrame())
for col, values in test_df.iteritems():

    if isinstance(values.iloc[0], ObjectId):
        arr = pa.array(
            values.map(lambda oid: oid.binary), type=pa.binary(12)
        )

    elif isinstance(values.iloc[0], ...):
        ...

    else:
        arr = pa.array(values)

    table.append_column(arr, col)  # FAILS (wrong type)

Pseudocode of the ideal solution:

parquetize(df, path, my_custom_types_conversions)
# ...
new_df = unparquetize(path, my_custom_types_conversions)

assert df.equals(new_df)  # types have been correctly restored

I’m getting lost in pyarrow’s doc on if I should use ExtensionType, serialization or other things to write these functions. Any pointer would be appreciated.

Side note, I do not need parquet at all means, the main issue is to being able to save and restore dataframes with custom types quickly and space efficiently. I tried a solution based on jsonifying and gziping the dataframe, but it was too slow.

like image 908
Silver Duck Avatar asked Apr 17 '20 12:04

Silver Duck


People also ask

How do I write a pandas DataFrame to parquet?

Pandas DataFrame: to_parquet() function The to_parquet() function is used to write a DataFrame to the binary parquet format. This function writes the dataframe as a parquet file. File path or Root Directory path. Will be used as Root Directory path while writing a partitioned dataset.

Can pandas DataFrame store different data types?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

Does pandas use PyArrow?

To interface with pandas, PyArrow provides various conversion routines to consume pandas structures and convert back to them. While pandas uses NumPy as a backend, it has enough peculiarities (such as a different type system, and support for null values) that this is a separate topic from NumPy Integration.

Does parquet preserve data type?

parquet has a number of strengths: It preserves type information: Unlike a CSV, parquet files remember what columns are numeric, which are categorical, etc. etc., so when you re-load your data you can be assured it will look the same as it did when you saved it.


1 Answers

I think it is probably because the 'ObjectId' is not a defined keyword in python hence it is throwing up this exception in type conversion.

I tried the example you provided and tried by casting the oid values as string type during dataframe creation and it worked.

Check below the steps:

df = pd.DataFrame(
    [
        {'name': 'alice', 'oid': "ObjectId('5e9992543bfddb58073803e7')"},
        {'name': 'bob',   'oid': "ObjectId('5e9992543bfddb58073803e8')"},
    ]
)

df.to_parquet('parquet_file.parquet')
df1 = pd.read_parquet('parquet_file.parquet',engine='pyarrow')
df1

output:

    name    oid
0   alice   ObjectId('5e9992543bfddb58073803e7')
1   bob ObjectId('5e9992543bfddb58073803e8')

like image 176
aninda Avatar answered Sep 19 '22 08:09

aninda