Is there a way for me to generate a pyarrow schema in this format from a pandas DF? I have some files which have hundreds of columns so I can't type it out manually.
fields = [
pa.field('id', pa.int64()),
pa.field('date', pa.timestamp('ns')),
pa.field('name', pa.string()),
pa.field('status', pa.dictionary(pa.int8(), pa.string(), ordered=False),
]
I'd like to save it in a file and then refer to it explicitly when I save data with to_parquet.
I tried to use schema = pa.Schema.from_pandas(df) but when I print out schema it is in a different format (I can't save it as a list of data type tuples like the fields example above).
Ideally, I would take a pandas dtype dictionary and then remap it into the fields list above. Is that possible?
schema = {
'id': 'int64',
'date': 'datetime64[ns]',
'name': 'object',
'status': 'category',
}
Otherwise, I will make the dtype schema, print it out and paste it into a file, make any required corrections, and then do a df = df.astype(schema) before saving the file to Parquet. However, I know I can run into issues with fully null columns in a partition or object columns with mixed data types.
I really don't understand why pa.Schema.from_pandas(df) doesn't work for you.
As far as I understood what you need is this:
schema = pa.Schema.from_pandas(df)
fields = []
for col_name, col_type in zip(schema.names, schema.types):
fields.append(pa.field(col_name, col_type))
or using list comprehension:
schema = pa.Schema.from_pandas(df)
fields = [pa.field(col_name, col_type) for col_name, col_type in zip(schema.names, schema.types)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With