Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error when trying to write DataFrame to feather. Does feather support list columns?

I'm working with both R and Python and I want to write one of my pandas DataFrames as a feather so I can work with it more easily in R. However, when I try to write it as a feather, I get the following error:

ArrowInvalid: trying to convert NumPy type float64 but got float32

I doubled checked my column types and they are already float 64:

In[1]
df.dtypes

Out[1]
id         Object
cluster    int64
vector_x   float64
vector_y   float64

I get the same error regardless of using feather.write_dataframe(df, "path/df.feather") or df.to_feather("path/df.feather").

I saw this on GitHub but didn't understand if it was related or not: https://issues.apache.org/jira/browse/ARROW-1345 and https://github.com/apache/arrow/issues/1430

In the end, I can just save it as a csv and change the columns in R (or just do the whole analysis in Python), but I was hoping to use this.

Edit 1:

Still having the same issue despite the great advice below so updating what I've tried.

df[['vector_x', 'vector_y', 'cluster']] = df[['vector_x', 'vector_y', 'cluster']].astype(float)

df[['doc_id', 'text']] = df[['doc_id', 'text']].astype(str)

df[['doc_vector', 'doc_vectors_2d']] = df[['doc_vector', 'doc_vectors_2d']].astype(list)

df.dtypes

Out[1]:
doc_id           object
text             object
doc_vector       object
cluster          float64
doc_vectors_2d   object
vector_x         float64
vector_y         float64
dtype: object

Edit 2:

After much searching, it appears that the issue is that my cluster column is a list type made up of int64 integers. So I guess the real quest is, does feather format support lists?

Edit 3:

Just to tie this in a bow, feather does not support nested data types like lists, at least not yet.

like image 411
Ben G Avatar asked Jan 24 '19 20:01

Ben G


3 Answers

The problem in your case is the id Object column. These are Python objects and they cannot represented in a language neutral format. This feather (actually the underlying Apache Arrow / pyarrow) is trying to guess the DataType of the id column. The guess is done on the first objects it sees in the column. These are float64 numpy scalars. Later, you have float32 scalars. Instead of coercing them to some type, Arrow is more strict with types and fails.

You should be able to work around this problem by ensuring that all columns have a non-object dtype with df['id'] = df['id'].astype(float).

like image 148
Uwe L. Korn Avatar answered Oct 26 '22 03:10

Uwe L. Korn


After much research, the simple answer is that feather does not support list (or other nested data type) columns.

like image 37
Ben G Avatar answered Oct 26 '22 03:10

Ben G


  • Luckly, I got the reason of my feather IO error here.
  • And I also got the solution for this problem, pandas.to_feather and read_feather are both based on pyarrow, and a column that contains lists as values is already support by pyarrow from 2019.

Solution:

pip install pyarrow==latest # my version is 1.0.0 and it work

Then, still use pd.to_feather("Filename") and read_feather.

like image 20
Ajay Liu Avatar answered Oct 26 '22 02:10

Ajay Liu