Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to migrate pandas code to pandas arrow?

Recently pandas 2.0 supported arrow datatypes, which seem to have many advantages over the standard datatypes, both in speed and with nan support.

I need to migrate a large code base to pandas arrow and I was wondering which kind of problems I may face, I didn´t find any migration guide or similar.

I imagine I will need to change all the functions that load data so they use the arrow back-end. Now integer columns will support NA, that is not a big deal, but my questions are:

  • I'm missing something, possible issues or incompatibilities?
  • Are common data-types interoperable?
  • Is such a migration feasible?
  • which other problems may I face?

I hope these questions are not to broad for SO.

like image 456
Ziur Olpa Avatar asked Sep 15 '25 13:09

Ziur Olpa


2 Answers

There is a short PyArrow user guide in the official pandas documentation.

Instantiation can be done using aliases for dtypes:

ser = pd.Series([-1.5, 0.2, None], dtype='float32[pyarrow]')

0    -1.5
1     0.2
2    <NA>
dtype: float[pyarrow]

Or specifying the engine:

data = io.StringIO('''a,b,c
   1,2.5,True
   3,4.5,False
''')

df = pd.read_csv(data, engine='pyarrow')

Operations are compatible between PyArrow and classical pandas objects:

ser.add(pd.Series([1, 2, 3]))

0    -0.5
1     2.2
2    <NA>
dtype: float[pyarrow]

One major difference with classical pandas object might be the support of nested structures (which would result in the object dtype, like Series of lists, with numpy-based pandas):

import pyarrow as pa
import pandas as pd

pa_array = pa.array(
    [{'1': '2'}, {'10': '20'}, None],
    type=pa.map_(pa.string(), pa.string()),
)

ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))

0      [('1', '2')]
1    [('10', '20')]
2              <NA>
dtype: map<string, string>[pyarrow]

Expected advantages will be the improved performance on some operations (a few examples are given in the documentation):

PyArrow data structure integration is implemented through pandas’ ExtensionArray interface; therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality is accelerated with PyArrow compute functions where available.

This includes:

  • Numeric aggregations
  • Numeric arithmetic
  • Numeric rounding
  • Logical and comparison functions
  • String functionality
  • Datetime functionality
like image 156
mozway Avatar answered Sep 18 '25 08:09

mozway


For large codes I encountered major adjustment to be necessary to migrate from pandas 1.x.x to pandas 2.x.x.

There were a bunch of quite subtle changes in the pandas interface (compare Backwards incompatible API changes) like this one:

Creating a new DataFrame using a full slice on both axes with loc or iloc (thus, df.loc[:, :] or df.iloc[:, :]) now returns a new DataFrame (shallow copy) instead of the original DataFrame, consistent with other methods to get a full slice (for example df.loc[:] or df[:]) (GH49469)

Chances are pretty high, that existing unittests will not capture (all of those) changes and upgrading pandas will introduce bugs, like a variable referring now to a shallow copy of a DataFrame (pandas 2.x.x) as compared to the DataFrame directly (pandas 1.x.x).

As of August 2023, there are no migration guides or open source tools available addressing this topic.

I would recommend to automatically run the code/unittests with a debugger in parallel on pandas 1.x.x and pandas 2.x.x and compare the local variables.

Scheme to migrate code from pandas 1 to pandas 2

This concept is taken from here (Migrate code from pandas 1 to pandas 2), where you can also find a more detailed discussion of the topic.

like image 20
Markus Dutschke Avatar answered Sep 18 '25 09:09

Markus Dutschke