Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas much slower than numpy?

The code below suggests that pandas may be much slower than numpy, at least in the specifi case of the function clip(). What is surprising is that making a roundtrip from pandas to numpy and back to pandas, while performing the calculations in numpy, is still much faster than doing it in pandas.

Shouldn't the pandas function have been implemented in this roundabout way?

In [49]: arr = np.random.randn(1000, 1000)

In [50]: df=pd.DataFrame(arr)

In [51]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 8.18 ms per loop

In [52]: %timeit df.clip_lower(0)
1 loops, best of 3: 344 ms per loop

In [53]: %timeit pd.DataFrame(np.clip(df.values, 0, None))
100 loops, best of 3: 8.4 ms per loop
like image 397
Soldalma Avatar asked Nov 07 '13 10:11

Soldalma


People also ask

Why is NumPy so much faster than Pandas?

NumPy's main data object is an array, specifically ndarray. These ndarrays are significantly faster than the list-based arrays in Python since no looping is required. In Pandas, the primary data objects are DataFrames and series, equivalent to a one-dimensional array.

Is Pandas DataFrame faster than NumPy?

Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.

How much faster is NumPy than Pandas?

NumPy can be said to be faster in performance than Pandas, up to fifty thousand (50K) rows and less of the dataset. (The performance between fifty thousand rows to five hundred thousand rows mostly depends on the type of operation Pandas, and NumPy are going to have to perform.)

Why is Pandas so slow?

Pandas is slow at retrieving data and Python, generally, is slow at mathematical computations. Constantly retrieving data from Pandas and computing it in Python means using our tools exactly where they are weak.


1 Answers

In master/0.13 (release very shortly), this is much faster (still slightly slower that native numpy because of handling of alignment/dtype/nans).

In 0.12 it was applying per column, so this was a relatively expensive operation.

In [4]: arr = np.random.randn(1000, 1000)

In [5]: df=pd.DataFrame(arr)

In [6]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 6.62 ms per loop

In [7]: %timeit df.clip_lower(0)
100 loops, best of 3: 12.9 ms per loop
like image 154
Jeff Avatar answered Oct 10 '22 10:10

Jeff