How to migrate pandas code to pandas arrow?

Question

Recently pandas 2.0 supported arrow datatypes, which seem to have many advantages over the standard datatypes, both in speed and with nan support.

I need to migrate a large code base to pandas arrow and I was wondering which kind of problems I may face, I didn´t find any migration guide or similar.

I imagine I will need to change all the functions that load data so they use the arrow back-end. Now integer columns will support NA, that is not a big deal, but my questions are:

I'm missing something, possible issues or incompatibilities?
Are common data-types interoperable?
Is such a migration feasible?
which other problems may I face?

I hope these questions are not to broad for SO.

mozway · Accepted Answer

There is a short PyArrow user guide in the official pandas documentation.

Instantiation can be done using aliases for dtypes:

ser = pd.Series([-1.5, 0.2, None], dtype='float32[pyarrow]')

0    -1.5
1     0.2
2    <NA>
dtype: float[pyarrow]

Or specifying the engine:

data = io.StringIO('''a,b,c
   1,2.5,True
   3,4.5,False
''')

df = pd.read_csv(data, engine='pyarrow')

Operations are compatible between PyArrow and classical pandas objects:

ser.add(pd.Series([1, 2, 3]))

0    -0.5
1     2.2
2    <NA>
dtype: float[pyarrow]

One major difference with classical pandas object might be the support of nested structures (which would result in the object dtype, like Series of lists, with numpy-based pandas):

import pyarrow as pa
import pandas as pd

pa_array = pa.array(
    [{'1': '2'}, {'10': '20'}, None],
    type=pa.map_(pa.string(), pa.string()),
)

ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))

0      [('1', '2')]
1    [('10', '20')]
2              <NA>
dtype: map<string, string>[pyarrow]

Expected advantages will be the improved performance on some operations (a few examples are given in the documentation):

PyArrow data structure integration is implemented through pandas’ ExtensionArray interface; therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality is accelerated with PyArrow compute functions where available.

This includes:

Numeric aggregations

Numeric arithmetic

Numeric rounding

Logical and comparison functions

String functionality

Datetime functionality

Markus Dutschke · Answer

For large codes I encountered major adjustment to be necessary to migrate from pandas 1.x.x to pandas 2.x.x.

There were a bunch of quite subtle changes in the pandas interface (compare Backwards incompatible API changes) like this one:

Creating a new DataFrame using a full slice on both axes with loc or iloc (thus, df.loc[:, :] or df.iloc[:, :]) now returns a new DataFrame (shallow copy) instead of the original DataFrame, consistent with other methods to get a full slice (for example df.loc[:] or df[:]) (GH49469)

Chances are pretty high, that existing unittests will not capture (all of those) changes and upgrading pandas will introduce bugs, like a variable referring now to a shallow copy of a DataFrame (pandas 2.x.x) as compared to the DataFrame directly (pandas 1.x.x).

As of August 2023, there are no migration guides or open source tools available addressing this topic.

I would recommend to automatically run the code/unittests with a debugger in parallel on pandas 1.x.x and pandas 2.x.x and compare the local variables.

Scheme to migrate code from pandas 1 to pandas 2

This concept is taken from here (Migrate code from pandas 1 to pandas 2), where you can also find a more detailed discussion of the topic.

How to migrate pandas code to pandas arrow?

Tags:

python

pandas

apache-arrow

Ziur Olpa

2 Answers

mozway

For large codes I encountered major adjustment to be necessary to migrate from pandas 1.x.x to pandas 2.x.x.

As of August 2023, there are no migration guides or open source tools available addressing this topic.

Markus Dutschke

Recent Activity

Donate For Us

How to migrate pandas code to pandas arrow?

Tags:

python

pandas

apache-arrow

Ziur Olpa

2 Answers

mozway

For large codes I encountered major adjustment to be necessary to migrate from pandas 1.x.x to pandas 2.x.x.

As of August 2023, there are no migration guides or open source tools available addressing this topic.

Markus Dutschke

Related questions

Recent Activity

Donate For Us