Why are numpy functions so slow on pandas series / dataframes?

Tags:

Consider a small MWE, taken from another question:

DateTime                Data 2017-11-21 18:54:31     1 2017-11-22 02:26:48     2 2017-11-22 10:19:44     3 2017-11-22 15:11:28     6 2017-11-22 23:21:58     7 2017-11-28 14:28:28    28 2017-11-28 14:36:40     0 2017-11-28 14:59:48     1

The goal is to clip all values with an upper bound of 1. My answer uses np.clip, which works fine.

Click to copy

np.clip(df.Data, a_min=None, a_max=1) array([1, 1, 1, 1, 1, 1, 0, 1])

Or,

Click to copy

np.clip(df.Data.values, a_min=None, a_max=1) array([1, 1, 1, 1, 1, 1, 0, 1])

Both of which return the same answer. My question is about the relative performance of these two methods. Consider -

Click to copy

df = pd.concat([df]*1000).reset_index(drop=True)  %timeit np.clip(df.Data, a_min=None, a_max=1) 1000 loops, best of 3: 270 µs per loop  %timeit np.clip(df.Data.values, a_min=None, a_max=1) 10000 loops, best of 3: 23.4 µs per loop

Why is there such a massive difference between the two, just by calling values on the latter? In other words...

Why are numpy functions so slow on pandas objects?

365

asked Dec 19 '17 19:12

cs95

2 Answers

Yes, it seems like np.clip is a lot slower on pandas.Series than on numpy.ndarrays. That's correct but it's actually (at least asymptotically) not that bad. 8000 elements is still in the regime where constant factors are major contributors in the runtime. I think this is a very important aspect to the question, so I'm visualizing this (borrowing from another answer):

Click to copy

# Setup  import pandas as pd import numpy as np  def on_series(s):     return np.clip(s, a_min=None, a_max=1)  def on_values_of_series(s):     return np.clip(s.values, a_min=None, a_max=1)  # Timing setup timings = {on_series: [], on_values_of_series: []} sizes = [2**i for i in range(1, 26, 2)]  # Timing for size in sizes:     func_input = pd.Series(np.random.randint(0, 30, size=size))     for func in timings:         res = %timeit -o func(func_input)         timings[func].append(res)  %matplotlib notebook  import matplotlib.pyplot as plt import numpy as np  fig, (ax1, ax2) = plt.subplots(1, 2)  for func in timings:     ax1.plot(sizes,               [time.best for time in timings[func]],               label=str(func.__name__)) ax1.set_xscale('log') ax1.set_yscale('log') ax1.set_xlabel('size') ax1.set_ylabel('time [seconds]') ax1.grid(which='both') ax1.legend()  baseline = on_values_of_series # choose one function as baseline for func in timings:     ax2.plot(sizes,               [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])],               label=str(func.__name__)) ax2.set_yscale('log') ax2.set_xscale('log') ax2.set_xlabel('size') ax2.set_ylabel('time relative to {}'.format(baseline.__name__)) ax2.grid(which='both') ax2.legend()  plt.tight_layout()

enter image description here

It's a log-log plot because I think this shows the important features more clearly. For example it shows that np.clip on a numpy.ndarray is faster but it also has a much smaller constant factor in that case. The difference for large arrays is only ~3! That's still a big difference but way less than the difference on small arrays.

However, that's still not an answer to the question where the time difference comes from.

The solution is actually quite simple: np.clip delegates to the clip method of the first argument:

Click to copy

>>> np.clip?? Source:    def clip(a, a_min, a_max, out=None):     """     ...     """     return _wrapfunc(a, 'clip', a_min, a_max, out=out)  >>> np.core.fromnumeric._wrapfunc?? Source:    def _wrapfunc(obj, method, *args, **kwds):     try:         return getattr(obj, method)(*args, **kwds)     # ...     except (AttributeError, TypeError):         return _wrapit(obj, method, *args, **kwds)

The getattr line of the _wrapfunc function is the important line here, because np.ndarray.clip and pd.Series.clip are different methods, yes, completely different methods:

Click to copy

>>> np.ndarray.clip <method 'clip' of 'numpy.ndarray' objects> >>> pd.Series.clip <function pandas.core.generic.NDFrame.clip>

Unfortunately is np.ndarray.clip a C-function so it's hard to profile it, however pd.Series.clip is a regular Python function so it's easy to profile. Let's use a Series of 5000 integers here:

Click to copy

s = pd.Series(np.random.randint(0, 100, 5000))

For the np.clip on the values I get the following line-profiling:

Click to copy

%load_ext line_profiler %lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1)  Timer unit: 4.10256e-07 s  Total time: 2.25641e-05 s File: numpy\core\fromnumeric.py Function: clip at line 1673  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   1673                                           def clip(a, a_min, a_max, out=None):   1674                                               """   ...   1726                                               """   1727         1           55     55.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)  Total time: 1.51795e-05 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     55                                           def _wrapfunc(obj, method, *args, **kwds):     56         1            2      2.0      5.4      try:     57         1           35     35.0     94.6          return getattr(obj, method)(*args, **kwds)     58                                                59                                               # An AttributeError occurs if the object does not have     60                                               # such a method in its class.     61                                                62                                               # A TypeError occurs if the object does have such a method     63                                               # in its class, but its signature is not identical to that     64                                               # of NumPy's. This situation has occurred in the case of     65                                               # a downstream library like 'pandas'.     66                                               except (AttributeError, TypeError):     67                                                   return _wrapit(obj, method, *args, **kwds)

But for np.clip on the Series I get a totally different profiling result:

Click to copy

%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)  Timer unit: 4.10256e-07 s  Total time: 0.000823794 s File: numpy\core\fromnumeric.py Function: clip at line 1673  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   1673                                           def clip(a, a_min, a_max, out=None):   1674                                               """   ...   1726                                               """   1727         1         2008   2008.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)  Total time: 0.00081846 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     55                                           def _wrapfunc(obj, method, *args, **kwds):     56         1            2      2.0      0.1      try:     57         1         1993   1993.0     99.9          return getattr(obj, method)(*args, **kwds)     58                                                59                                               # An AttributeError occurs if the object does not have     60                                               # such a method in its class.     61                                                62                                               # A TypeError occurs if the object does have such a method     63                                               # in its class, but its signature is not identical to that     64                                               # of NumPy's. This situation has occurred in the case of     65                                               # a downstream library like 'pandas'.     66                                               except (AttributeError, TypeError):     67                                                   return _wrapit(obj, method, *args, **kwds)  Total time: 0.000804922 s File: pandas\core\generic.py Function: clip at line 4969  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,   4970                                                        *args, **kwargs):   4971                                                   """   ...   5021                                                   """   5022         1           12     12.0      0.6          if isinstance(self, ABCPanel):   5023                                                       raise NotImplementedError("clip is not supported yet for panels")   5024                                              5025         1           10     10.0      0.5          inplace = validate_bool_kwarg(inplace, 'inplace')   5026                                              5027         1           69     69.0      3.5          axis = nv.validate_clip_with_axis(axis, args, kwargs)   5028                                              5029                                                   # GH 17276   5030                                                   # numpy doesn't like NaN as a clip value   5031                                                   # so ignore   5032         1          158    158.0      8.1          if np.any(pd.isnull(lower)):   5033         1            3      3.0      0.2              lower = None   5034         1           26     26.0      1.3          if np.any(pd.isnull(upper)):   5035                                                       upper = None   5036                                              5037                                                   # GH 2747 (arguments were reversed)   5038         1            1      1.0      0.1          if lower is not None and upper is not None:   5039                                                       if is_scalar(lower) and is_scalar(upper):   5040                                                           lower, upper = min(lower, upper), max(lower, upper)   5041                                              5042                                                   # fast-path for scalars   5043         1            1      1.0      0.1          if ((lower is None or (is_scalar(lower) and is_number(lower))) and   5044         1           28     28.0      1.4                  (upper is None or (is_scalar(upper) and is_number(upper)))):   5045         1         1654   1654.0     84.3              return self._clip_with_scalar(lower, upper, inplace=inplace)   5046                                              5047                                                   result = self   5048                                                   if lower is not None:   5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)   5050                                                   if upper is not None:   5051                                                       if inplace:   5052                                                           result = self   5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)   5054                                              5055                                                   return result  Total time: 0.000662153 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):   4921         1            2      2.0      0.1          if ((lower is not None and np.any(isna(lower))) or   4922         1           25     25.0      1.5                  (upper is not None and np.any(isna(upper)))):   4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")   4924                                              4925         1           22     22.0      1.4          result = self.values   4926         1          571    571.0     35.4          mask = isna(result)   4927                                              4928         1           95     95.0      5.9          with np.errstate(all='ignore'):   4929         1            1      1.0      0.1              if upper is not None:   4930         1          141    141.0      8.7                  result = np.where(result >= upper, upper, result)   4931         1           33     33.0      2.0              if lower is not None:   4932                                                           result = np.where(result <= lower, lower, result)   4933         1           73     73.0      4.5          if np.any(mask):   4934                                                       result[mask] = np.nan   4935                                              4936         1           90     90.0      5.6          axes_dict = self._construct_axes_dict()   4937         1          558    558.0     34.6          result = self._constructor(result, **axes_dict).__finalize__(self)   4938                                              4939         1            2      2.0      0.1          if inplace:   4940                                                       self._update_inplace(result)   4941                                                   else:   4942         1            1      1.0      0.1              return result

I stopped going into the subroutines at that point because it already highlights where the pd.Series.clip does much more work than the np.ndarray.clip. Just compare the total time of the np.clip call on the values (55 timer units) to one of the first checks in the pandas.Series.clip method, the if np.any(pd.isnull(lower)) (158 timer units). At that point the pandas method didn't even start at clipping and it already takes 3 times longer.

However several of these "overheads" become insignificant when the array is big:

Click to copy

s = pd.Series(np.random.randint(0, 100, 1000000))  %lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)  Timer unit: 4.10256e-07 s  Total time: 0.00593476 s File: numpy\core\fromnumeric.py Function: clip at line 1673  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   1673                                           def clip(a, a_min, a_max, out=None):   1674                                               """   ...   1726                                               """   1727         1        14466  14466.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)  Total time: 0.00592779 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     55                                           def _wrapfunc(obj, method, *args, **kwds):     56         1            1      1.0      0.0      try:     57         1        14448  14448.0    100.0          return getattr(obj, method)(*args, **kwds)     58                                                59                                               # An AttributeError occurs if the object does not have     60                                               # such a method in its class.     61                                                62                                               # A TypeError occurs if the object does have such a method     63                                               # in its class, but its signature is not identical to that     64                                               # of NumPy's. This situation has occurred in the case of     65                                               # a downstream library like 'pandas'.     66                                               except (AttributeError, TypeError):     67                                                   return _wrapit(obj, method, *args, **kwds)  Total time: 0.00591302 s File: pandas\core\generic.py Function: clip at line 4969  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,   4970                                                        *args, **kwargs):   4971                                                   """   ...   5021                                                   """   5022         1           17     17.0      0.1          if isinstance(self, ABCPanel):   5023                                                       raise NotImplementedError("clip is not supported yet for panels")   5024                                              5025         1           14     14.0      0.1          inplace = validate_bool_kwarg(inplace, 'inplace')   5026                                              5027         1           97     97.0      0.7          axis = nv.validate_clip_with_axis(axis, args, kwargs)   5028                                              5029                                                   # GH 17276   5030                                                   # numpy doesn't like NaN as a clip value   5031                                                   # so ignore   5032         1          125    125.0      0.9          if np.any(pd.isnull(lower)):   5033         1            2      2.0      0.0              lower = None   5034         1           30     30.0      0.2          if np.any(pd.isnull(upper)):   5035                                                       upper = None   5036                                              5037                                                   # GH 2747 (arguments were reversed)   5038         1            2      2.0      0.0          if lower is not None and upper is not None:   5039                                                       if is_scalar(lower) and is_scalar(upper):   5040                                                           lower, upper = min(lower, upper), max(lower, upper)   5041                                              5042                                                   # fast-path for scalars   5043         1            2      2.0      0.0          if ((lower is None or (is_scalar(lower) and is_number(lower))) and   5044         1           32     32.0      0.2                  (upper is None or (is_scalar(upper) and is_number(upper)))):   5045         1        14092  14092.0     97.8              return self._clip_with_scalar(lower, upper, inplace=inplace)   5046                                              5047                                                   result = self   5048                                                   if lower is not None:   5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)   5050                                                   if upper is not None:   5051                                                       if inplace:   5052                                                           result = self   5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)   5054                                              5055                                                   return result  Total time: 0.00575753 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920  Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):   4921         1            2      2.0      0.0          if ((lower is not None and np.any(isna(lower))) or   4922         1           28     28.0      0.2                  (upper is not None and np.any(isna(upper)))):   4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")   4924                                              4925         1          120    120.0      0.9          result = self.values   4926         1         3525   3525.0     25.1          mask = isna(result)   4927                                              4928         1           86     86.0      0.6          with np.errstate(all='ignore'):   4929         1            2      2.0      0.0              if upper is not None:   4930         1         9314   9314.0     66.4                  result = np.where(result >= upper, upper, result)   4931         1           61     61.0      0.4              if lower is not None:   4932                                                           result = np.where(result <= lower, lower, result)   4933         1          283    283.0      2.0          if np.any(mask):   4934                                                       result[mask] = np.nan   4935                                              4936         1           78     78.0      0.6          axes_dict = self._construct_axes_dict()   4937         1          532    532.0      3.8          result = self._constructor(result, **axes_dict).__finalize__(self)   4938                                              4939         1            2      2.0      0.0          if inplace:   4940                                                       self._update_inplace(result)   4941                                                   else:   4942         1            1      1.0      0.0              return result

There are still multiple function calls, for example isna and np.where, that take a significant amount of time, but overall this is at least comparable to the np.ndarray.clip time (that's in the regime where the timing difference is ~3 on my computer).

The takeaway should probably be:

Many NumPy functions just delegate to a method of the object passed in, so there can be huge differences when you pass in different objects.
Profiling, especially line-profiling, can be a great tool to find the places where the performance difference comes from.
Always make sure to test differently sized objects in such cases. You could be comparing constant factors that probably don't matter except if you process lots of small arrays.

Used versions:

Click to copy

Python 3.6.3 64-bit on Windows 10 Numpy 1.13.3 Pandas 0.21.1

answered Oct 08 '22 09:10

MSeifert

Just read the source code, it's clear.

Click to copy

def clip(a, a_min, a_max, out=None):     """a : array_like Array containing elements to clip."""     return _wrapfunc(a, 'clip', a_min, a_max, out=out)  def _wrapfunc(obj, method, *args, **kwds):     try:         return getattr(obj, method)(*args, **kwds)     #This situation has occurred in the case of     # a downstream library like 'pandas'.     except (AttributeError, TypeError):         return _wrapit(obj, method, *args, **kwds)  def _wrapit(obj, method, *args, **kwds):     try:         wrap = obj.__array_wrap__     except AttributeError:         wrap = None     result = getattr(asarray(obj), method)(*args, **kwds)     if wrap:         if not isinstance(result, mu.ndarray):             result = asarray(result)         result = wrap(result)     return result

rectify：

after pandas v0.13.0_ahl1,pandas has it's own implement of clip.

answered Oct 08 '22 11:10

obgnaw

Related questions
                            
                                Python Weather API [closed]
                            
                                How do I use cx_freeze?
                            
                                Python super() arguments: why not super(obj)?
                            
                                How to use str.contains() with multiple expressions, in pandas dataframes?
                            
                                Pivot String column on Pyspark Dataframe
                            
                                Is there anything faster than dict()?
                            
                                Setting initial Django form field value in the __init__ method
                            
                                Pop multiple items from the beginning and end of a list
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)
                            
                                python pickle gives "AttributeError: 'str' object has no attribute 'write'"
                            
                                Python unit test mock, get mocked function's input arguments
                            
                                Reading JSON from SimpleHTTPServer Post data
                            
                                Print 'std err' value from statsmodels OLS results
                            
                                Can I set variable column widths in pandas?
                            
                                Python read csv - BOM embedded into the first key
                            
                                How to rename a directory in Mercurial and continue to track all file changes
                            
                                Code-style for indention of multi-line 'if' statement? [duplicate]
                            
                                Including a directory using Pyinstaller
                            
                                Running ansible-playbook using Python API
                            
                                Share data between IPython Notebooks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are numpy functions so slow on pandas series / dataframes?

Tags:

performance

python

pandas

numpy

cs95

People also ask

2 Answers

MSeifert

rectify：

obgnaw

Recent Activity

Donate For Us