Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate a pyarrow schema in the format of a list of pa.fields?

Is there a way for me to generate a pyarrow schema in this format from a pandas DF? I have some files which have hundreds of columns so I can't type it out manually.

fields = [
    pa.field('id', pa.int64()),
    pa.field('date', pa.timestamp('ns')), 
    pa.field('name', pa.string()), 
    pa.field('status', pa.dictionary(pa.int8(), pa.string(), ordered=False),
]

I'd like to save it in a file and then refer to it explicitly when I save data with to_parquet.

I tried to use schema = pa.Schema.from_pandas(df) but when I print out schema it is in a different format (I can't save it as a list of data type tuples like the fields example above).

Ideally, I would take a pandas dtype dictionary and then remap it into the fields list above. Is that possible?

schema = {
  'id': 'int64',
  'date': 'datetime64[ns]', 
  'name': 'object', 
  'status': 'category',
}

Otherwise, I will make the dtype schema, print it out and paste it into a file, make any required corrections, and then do a df = df.astype(schema) before saving the file to Parquet. However, I know I can run into issues with fully null columns in a partition or object columns with mixed data types.

like image 884
trench Avatar asked Jan 26 '26 13:01

trench


1 Answers

I really don't understand why pa.Schema.from_pandas(df) doesn't work for you.

As far as I understood what you need is this:

schema = pa.Schema.from_pandas(df)
fields = []
for col_name, col_type in zip(schema.names, schema.types):
    fields.append(pa.field(col_name, col_type))

or using list comprehension:

schema = pa.Schema.from_pandas(df)
fields = [pa.field(col_name, col_type) for col_name, col_type in zip(schema.names, schema.types)]
like image 82
Vardan Grigoryants Avatar answered Jan 29 '26 04:01

Vardan Grigoryants