Recently pandas 2.0 supported arrow datatypes, which seem to have many advantages over the standard datatypes, both in speed and with nan support.
I need to migrate a large code base to pandas arrow and I was wondering which kind of problems I may face, I didn´t find any migration guide or similar.
I imagine I will need to change all the functions that load data so they use the arrow back-end. Now integer columns will support NA, that is not a big deal, but my questions are:
I hope these questions are not to broad for SO.
There is a short PyArrow user guide in the official pandas documentation.
Instantiation can be done using aliases for dtypes:
ser = pd.Series([-1.5, 0.2, None], dtype='float32[pyarrow]')
0 -1.5
1 0.2
2 <NA>
dtype: float[pyarrow]
Or specifying the engine:
data = io.StringIO('''a,b,c
1,2.5,True
3,4.5,False
''')
df = pd.read_csv(data, engine='pyarrow')
Operations are compatible between PyArrow and classical pandas objects:
ser.add(pd.Series([1, 2, 3]))
0 -0.5
1 2.2
2 <NA>
dtype: float[pyarrow]
One major difference with classical pandas object might be the support of nested structures (which would result in the object dtype, like Series of lists, with numpy-based pandas):
import pyarrow as pa
import pandas as pd
pa_array = pa.array(
[{'1': '2'}, {'10': '20'}, None],
type=pa.map_(pa.string(), pa.string()),
)
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
0 [('1', '2')]
1 [('10', '20')]
2 <NA>
dtype: map<string, string>[pyarrow]
Expected advantages will be the improved performance on some operations (a few examples are given in the documentation):
PyArrow data structure integration is implemented through pandas’ ExtensionArray interface; therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality is accelerated with PyArrow compute functions where available.
This includes:
- Numeric aggregations
- Numeric arithmetic
- Numeric rounding
- Logical and comparison functions
- String functionality
- Datetime functionality
There were a bunch of quite subtle changes in the pandas interface (compare Backwards incompatible API changes) like this one:
Creating a new DataFrame using a full slice on both axes with loc or iloc (thus, df.loc[:, :] or df.iloc[:, :]) now returns a new DataFrame (shallow copy) instead of the original DataFrame, consistent with other methods to get a full slice (for example df.loc[:] or df[:]) (GH49469)
Chances are pretty high, that existing unittests will not capture (all of those) changes and upgrading pandas will introduce bugs, like a variable referring now to a shallow copy of a DataFrame
(pandas 2.x.x) as compared to the DataFrame
directly (pandas 1.x.x).
I would recommend to automatically run the code/unittests with a debugger in parallel on pandas 1.x.x and pandas 2.x.x and compare the local variables.
This concept is taken from here (Migrate code from pandas 1 to pandas 2), where you can also find a more detailed discussion of the topic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With