Using pyarrow
to convert a pandas.DataFrame
containing Player
objects to a pyarrow.Table
with the following code
import pandas as pd
import pyarrow as pa
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
data = [
Player('Jack', 21, 'm'),
Player('Ryan', 18, 'm'),
Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))
we get the error:
pyarrow.lib.ArrowInvalid: ('Could not convert <Jack (21)> with type Player: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object')
Same error encountered by using
df.to_parquet('players.pq')
Is it possible for pyarrow
to fallback to serializing these Python objects using pickle
? Or is there a better solution? The pyarrow.Table
will eventually be written to disk using Parquet.write_table()
.
pandas.DataFrame.to_parquet()
does not support multi index, so a solution using pq.write_table(pa.Table.from_dataframe(pandas.DataFrame))
is preferred.Thank you!
My suggestion will be to insert the data into the DataFrame already serialized.
Define the Player class as a dataclass by the decorator, and let the serialization be done natively for you (to JSON).
import pandas as pd
from dataclasses import dataclass
@dataclass
class PlayerV2:
name:str
age:int
gender:str
def __repr__(self):
return f'<{self.name} ({self.age})>'
dataV2 = [
PlayerV2(name='Jack', age=21, gender='m'),
PlayerV2(name='Ryan', age=18, gender='m'),
PlayerV2(name='Jane', age=35, gender='f'),
]
# The serialization is done natively to JSON
df_v2 = pd.DataFrame(data, columns=['player'])
print(df_v2)
# Can still get the objects's attributes by deserializeing the record
json.loads(df_v2["player"][0])['name']
Define a serialization function in the Player class and serialize each of the instances before the creation of the Dataframe.
import pandas as pd
import json
class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender
def __repr__(self):
return f'<{self.name} ({self.age})>'
# The serialization function for JSON, if for some reason you really need pickle you can use it instead
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__)
# Serialize the objects before inserting it into the DataFrame
data = [
Player('Jack', 21, 'm').toJSON(),
Player('Ryan', 18, 'm').toJSON(),
Player('Jane', 35, 'f').toJSON(),
]
df = pd.DataFrame(data, columns=['player'])
# You can see all the data inserted as a serialized json into the column player
print(df)
# Can still get the objects's attributes by deserializeing the record
json.loads(df["player"][0])['name']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With