Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assign schema to pa.Table.from_pandas()

Im getting this error when transforming a pandas.DF to parquet using pyArrow:

ArrowInvalid('Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer

To find out which column is the problem I made a new df in a for loop, first with the first column and for each loop adding another column. I realized that the error is in a column of dtype: object that starts with 0s, I guess that's why pyArrow wants to convert the column to int but fails because other values are UUID

Im trying to pass a schema: (not sure if this is the way to go)

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

where schema is: df.dtypes

like image 254
Carlos P Ceballos Avatar asked Mar 29 '18 22:03

Carlos P Ceballos


People also ask

What is PyArrow schema?

class pyarrow. Schema. Bases: _Weakrefable. A named collection of types a.k.a schema. A schema defines the column names and types in a record batch or table data structure.

What is PyArrow used for?

This is the documentation of the Python API of Apache Arrow. Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.

Is PyArrow faster than pandas?

There's a better way. It's called PyArrow — an amazing Python binding for the Apache Arrow project. It introduces faster data read/write times and doesn't otherwise interfere with your data analysis pipeline. It's the best of both worlds, as you can still use Pandas for further calculations.


1 Answers

Carlos have you tried converting the column to one of the pandas types listed here https://arrow.apache.org/docs/python/pandas.html?

Can you post the output of df.dtypes?

If changing the pandas column type doesn't help you can define a pyarrow schema to pass in.

fields = [
    pa.field('id', pa.int64()),
    pa.field('secondaryid', pa.int64()),
    pa.field('date', pa.timestamp('ms')),
]

my_schema = pa.schema(fields)

table = pa.Table.from_pandas(sample_df, schema=my_schema, preserve_index=False)

More information here:

https://arrow.apache.org/docs/python/data.html https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_pandas https://arrow.apache.org/docs/python/generated/pyarrow.schema.html

like image 160
Alexander Avatar answered Oct 17 '22 17:10

Alexander