The code below suggests that pandas may be much slower than numpy, at least in the specifi case of the function clip(). What is surprising is that making a roundtrip from pandas to numpy and back to pandas, while performing the calculations in numpy, is still much faster than doing it in pandas.
Shouldn't the pandas function have been implemented in this roundabout way?
In [49]: arr = np.random.randn(1000, 1000)
In [50]: df=pd.DataFrame(arr)
In [51]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 8.18 ms per loop
In [52]: %timeit df.clip_lower(0)
1 loops, best of 3: 344 ms per loop
In [53]: %timeit pd.DataFrame(np.clip(df.values, 0, None))
100 loops, best of 3: 8.4 ms per loop
NumPy's main data object is an array, specifically ndarray. These ndarrays are significantly faster than the list-based arrays in Python since no looping is required. In Pandas, the primary data objects are DataFrames and series, equivalent to a one-dimensional array.
Numpy is memory efficient. Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
NumPy can be said to be faster in performance than Pandas, up to fifty thousand (50K) rows and less of the dataset. (The performance between fifty thousand rows to five hundred thousand rows mostly depends on the type of operation Pandas, and NumPy are going to have to perform.)
Pandas is slow at retrieving data and Python, generally, is slow at mathematical computations. Constantly retrieving data from Pandas and computing it in Python means using our tools exactly where they are weak.
In master/0.13 (release very shortly), this is much faster (still slightly slower that native numpy because of handling of alignment/dtype/nans).
In 0.12 it was applying per column, so this was a relatively expensive operation.
In [4]: arr = np.random.randn(1000, 1000)
In [5]: df=pd.DataFrame(arr)
In [6]: %timeit np.clip(arr, 0, None)
100 loops, best of 3: 6.62 ms per loop
In [7]: %timeit df.clip_lower(0)
100 loops, best of 3: 12.9 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With