Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up pandas with cython (or numpy)

Tags:

I am trying to use Cython to speed up a Pandas DataFrame computation which is relatively simple: iterating over each row in the DataFrame, add that row to itself and to all remaining rows in the DataFrame, sum these across each row, and yield the list of these sums. The length of these series will decrease as the rows in the DataFrame are exhausted. These series are stored as a dictionary keyed on the index row number.

def foo(df):     vals = {i: (df.iloc[i, :] + df.iloc[i:, :]).sum(axis=1).values.tolist()             for i in range(df.shape[0])}        return vals 

Aside from adding %%cython at the top of this function, does anyone have a recommendation on how I'd go about using cdefs to convert the DataFrame values to doubles and then cythonize this code?

Below is some dummy data:

>>> df            A         B         C         D         E 0 -0.326403  1.173797  1.667856 -1.087655  0.427145 1 -0.797344  0.004362  1.499460  0.427453 -0.184672 2 -1.764609  1.949906 -0.968558  0.407954  0.533869 3  0.944205  0.158495 -1.049090 -0.897253  1.236081 4 -2.086274  0.112697  0.934638 -1.337545  0.248608 5 -0.356551 -1.275442  0.701503  1.073797 -0.008074 6 -1.300254  1.474991  0.206862 -0.859361  0.115754 7 -1.078605  0.157739  0.810672  0.468333 -0.851664 8  0.900971  0.021618  0.173563 -0.562580 -2.087487 9  2.155471 -0.605067  0.091478  0.242371  0.290887 

and expected output:

>>> foo(df)  {0: [3.7094795101205236,   2.8039983729106,   2.013301815968468,   2.24717712931852,   -0.27313665495940964,   1.9899718844711711,   1.4927321304935717,   1.3612155622947018,   0.3008239883773878,   4.029880107986906],  . . .   6: [-0.72401524913338,   -0.8555318173322499,   -1.9159233912495635,   1.813132728359954],  7: [-0.9870483855311194, -2.047439959448434, 1.6816161601610844],  8: [-3.107831533365748, 0.6212245862437702],  9: [4.350280705853288]} 
like image 686
Alexander Avatar asked May 15 '15 23:05

Alexander


People also ask

Does Cython speed Pandas?

Cython (writing C extensions for pandas) For many use cases writing pandas in pure Python and NumPy is sufficient. In some computationally heavy applications however, it can be possible to achieve sizable speed-ups by offloading work to cython.

Which is faster Pandas or NumPy?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

Is apply faster than Itertuples?

As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.


1 Answers

If you're just trying to do it faster and not specifically using cython, I'd just do it in plain numpy (about 50x faster).

def numpy_foo(arr):     vals = {i: (arr[i, :] + arr[i:, :]).sum(axis=1).tolist()             for i in range(arr.shape[0])}        return vals  %timeit foo(df) 100 loops, best of 3: 7.2 ms per loop  %timeit numpy_foo(df.values) 10000 loops, best of 3: 144 µs per loop  foo(df) == numpy_foo(df.values) Out[586]: True 

Generally speaking, pandas gives you a lot of conveniences relative to numpy, but there are overhead costs. So in situations where pandas isn't really adding anything, you can generally speed things up by doing it in numpy. For another example, see this question I asked which showed a roughly comparable speed difference (about 23x).

like image 196
JohnE Avatar answered Oct 02 '22 01:10

JohnE