Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')

Using pyarrow to convert a pandas.DataFrame containing Player objects to a pyarrow.Table with the following code

import pandas as pd
import pyarrow as pa

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'

data = [
    Player('Jack', 21, 'm'),
    Player('Ryan', 18, 'm'),
    Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))

we get the error:

pyarrow.lib.ArrowInvalid: ('Could not convert <Jack (21)> with type Player: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object')

Same error encountered by using

df.to_parquet('players.pq')

Is it possible for pyarrow to fallback to serializing these Python objects using pickle? Or is there a better solution? The pyarrow.Table will eventually be written to disk using Parquet.write_table().

  • Using Python 3.8.0, pandas 0.25.3, pyarrow 0.13.0.
  • pandas.DataFrame.to_parquet() does not support multi index, so a solution using pq.write_table(pa.Table.from_dataframe(pandas.DataFrame)) is preferred.

Thank you!

like image 715
Nyxynyx Avatar asked Jan 07 '20 22:01

Nyxynyx


1 Answers

My suggestion will be to insert the data into the DataFrame already serialized.

Best option - Use dataclass (python >=3.7)

Define the Player class as a dataclass by the decorator, and let the serialization be done natively for you (to JSON).

import pandas as pd
from dataclasses import dataclass

@dataclass
class PlayerV2:
    name:str
    age:int
    gender:str

    def __repr__(self):
        return f'<{self.name} ({self.age})>'


dataV2 = [
    PlayerV2(name='Jack', age=21, gender='m'),
    PlayerV2(name='Ryan', age=18, gender='m'),
    PlayerV2(name='Jane', age=35, gender='f'),
]

# The serialization is done natively to JSON
df_v2 = pd.DataFrame(data, columns=['player'])
print(df_v2)

# Can still get the objects's attributes by deserializeing the record
json.loads(df_v2["player"][0])['name']

Manually serialize the object (python < 3.7)

Define a serialization function in the Player class and serialize each of the instances before the creation of the Dataframe.

import pandas as pd
import json

class Player:
    def __init__(self, name, age, gender):
        self.name = name
        self.age = age
        self.gender = gender

    def __repr__(self):
        return f'<{self.name} ({self.age})>'
    
    # The serialization function for JSON, if for some reason you really need pickle you can use it instead
    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__)

# Serialize the objects before inserting it into the DataFrame
data = [
    Player('Jack', 21, 'm').toJSON(),
    Player('Ryan', 18, 'm').toJSON(),
    Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])

# You can see all the data inserted as a serialized json into the column player
print(df)

# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']
like image 104
Nimrod Carmel Avatar answered Sep 18 '22 00:09

Nimrod Carmel