Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a series of ints to strings - Why is apply much faster than astype?

I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:

import numpy as np import pandas as pd  x = pd.Series(np.random.randint(0, 100, 1000000)) 

On StackOverflow and other websites, I've seen most people argue that the best way to do this is:

%% timeit x = x.astype(str) 

This takes about 2 seconds.

When I use x = x.apply(str), it only takes 0.2 seconds.

Why is x.astype(str) so slow? Should the recommended way be x.apply(str)?

I'm mainly interested in python 3's behavior for this.

like image 241
none Avatar asked Mar 19 '18 20:03

none


People also ask

What is the use of Astype?

The astype() method returns a new DataFrame where the data types has been changed to the specified type.

Which is the method function that you can apply for Converting data to float from mixed type?

Converting string to int/float Similarly, if we want to convert the data type to float, we can call astype('float') .

What does Astype int do in Python?

. astype() is a method within numpy. ndarray , as well as the Pandas Series class, so can be used to convert vectors, matrices and columns within a DataFrame . However, int() is a pure-Python function that can only be applied to scalar values.

How to convert data to numeric in Python?

In Python an strings can be converted into a integer using the built-in int() function. The int() function takes in any python data type and converts it into a integer.


1 Answers

Let's begin with a bit of general advise: If you're interested in finding the bottlenecks of Python code you can use a profiler to find the functions/parts that eat up most of the time. In this case I use a line-profiler because you can actually see the implementation and the time spent on each line.

However, these tools don't work with C or Cython by default. Given that CPython (that's the Python interpreter I'm using), NumPy and pandas make heavy use of C and Cython there will be a limit how far I'll get with profiling.

Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).

But let's see how far I can get:

Line-Profiling Python code

I'm going to use line-profiler and a Jupyter Notebook here:

%load_ext line_profiler  import numpy as np import pandas as pd  x = pd.Series(np.random.randint(0, 100, 100000)) 

Profiling x.astype

%lprun -f x.astype x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================     87                                                   @wraps(func)     88                                                   def wrapper(*args, **kwargs):     89         1           12     12.0      0.0              old_arg_value = kwargs.pop(old_arg_name, None)     90         1            5      5.0      0.0              if old_arg_value is not None:     91                                                           if mapping is not None:    ...    118         1       663354 663354.0    100.0              return func(*args, **kwargs) 

So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:

%lprun -f x.astype.__wrapped__ x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   3896                                               @deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',   3897                                                                mapping={True: 'raise', False: 'ignore'})   3898                                               def astype(self, dtype, copy=True, errors='raise', **kwargs):   3899                                                   """   ...   3975                                                   """   3976         1           28     28.0      0.0          if is_dict_like(dtype):   3977                                                       if self.ndim == 1:  # i.e. Series   ...   4001                                              4002                                                   # else, only a single dtype is given   4003         1           14     14.0      0.0          new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,   4004         1       685863 685863.0     99.9                                       **kwargs)   4005         1          340    340.0      0.0          return self._constructor(new_data).__finalize__(self) 

Source

Again one line is the bottleneck so let's check the _data.astype method:

%lprun -f x._data.astype x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   3461                                               def astype(self, dtype, **kwargs):   3462         1       695866 695866.0    100.0          return self.apply('astype', dtype=dtype, **kwargs) 

Okay, another delegate, let's see what _data.apply does:

%lprun -f x._data.apply x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   3251                                               def apply(self, f, axes=None, filter=None, do_integrity_check=False,   3252                                                         consolidate=True, **kwargs):   3253                                                   """   ...   3271                                                   """   3272                                              3273         1           12     12.0      0.0          result_blocks = []   ...   3309                                              3310         1           10     10.0      0.0          aligned_args = dict((k, kwargs[k])   3311         1           29     29.0      0.0                              for k in align_keys   3312                                                                       if hasattr(kwargs[k], 'reindex_axis'))   3313                                              3314         2           28     14.0      0.0          for b in self.blocks:   ...   3329         1       674974 674974.0    100.0              applied = getattr(b, f)(**kwargs)   3330         1           30     30.0      0.0              result_blocks = _extend_blocks(applied, result_blocks)   3331                                              3332         1           10     10.0      0.0          if len(result_blocks) == 0:   3333                                                       return self.make_empty(axes or self.axes)   3334         1           10     10.0      0.0          bm = self.__class__(result_blocks, axes or self.axes,   3335         1           76     76.0      0.0                              do_integrity_check=do_integrity_check)   3336         1           13     13.0      0.0          bm._consolidate_inplace()   3337         1            7      7.0      0.0          return bm 

Source

And again ... one function call is taking all the time, this time it's x._data.blocks[0].astype:

%lprun -f x._data.blocks[0].astype x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================    542                                               def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):    543         1           18     18.0      0.0          return self._astype(dtype, copy=copy, errors=errors, values=values,    544         1       671092 671092.0    100.0                              **kwargs) 

.. which is another delegate...

%lprun -f x._data.blocks[0]._astype x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================    546                                               def _astype(self, dtype, copy=False, errors='raise', values=None,    547                                                           klass=None, mgr=None, **kwargs):    548                                                   """    ...    557                                                   """    558         1           11     11.0      0.0          errors_legal_values = ('raise', 'ignore')    559                                               560         1            8      8.0      0.0          if errors not in errors_legal_values:    561                                                       invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "    562                                                                      "Supplied value is '{}'".format(    563                                                                          list(errors_legal_values), errors))    564                                                       raise ValueError(invalid_arg)    565                                               566         1           23     23.0      0.0          if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):    567                                                       msg = ("Expected an instance of {}, but got the class instead. "    568                                                              "Try instantiating 'dtype'.".format(dtype.__name__))    569                                                       raise TypeError(msg)    570                                               571                                                   # may need to convert to categorical    572                                                   # this is only called for non-categoricals    573         1           72     72.0      0.0          if self.is_categorical_astype(dtype):    ...    595                                               596                                                   # astype processing    597         1           16     16.0      0.0          dtype = np.dtype(dtype)    598         1           19     19.0      0.0          if self.dtype == dtype:    ...    603         1            8      8.0      0.0          if klass is None:    604         1           13     13.0      0.0              if dtype == np.object_:    605                                                           klass = ObjectBlock    606         1            6      6.0      0.0          try:    607                                                       # force the copy here    608         1            7      7.0      0.0              if values is None:    609                                               610         1            8      8.0      0.0                  if issubclass(dtype.type,    611         1           14     14.0      0.0                                (compat.text_type, compat.string_types)):    612                                               613                                                               # use native type formatting for datetime/tz/timedelta    614         1           15     15.0      0.0                      if self.is_datelike:    615                                                                   values = self.to_native_types()    616                                               617                                                               # astype formatting    618                                                               else:    619         1            8      8.0      0.0                          values = self.values    620                                               621                                                           else:    622                                                               values = self.get_values(dtype=dtype)    623                                               624                                                           # _astype_nansafe works fine with 1-d only    625         1       665777 665777.0     99.9                  values = astype_nansafe(values.ravel(), dtype, copy=True)    626         1           32     32.0      0.0                  values = values.reshape(self.shape)    627                                               628         1           17     17.0      0.0              newb = make_block(values, placement=self.mgr_locs, dtype=dtype,    629         1          269    269.0      0.0                                klass=klass)    630                                                   except:    631                                                       if errors == 'raise':    632                                                           raise    633                                                       newb = self.copy() if copy else self    634                                               635         1            8      8.0      0.0          if newb.is_numeric and self.is_numeric:    ...    642         1            6      6.0      0.0          return newb 

Source

... okay, still not there. Let's check out astype_nansafe:

%lprun -f pd.core.internals.astype_nansafe x.astype(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================    640                                           def astype_nansafe(arr, dtype, copy=True):    641                                               """ return a view if copy is False, but    642                                                   need to be very careful as the result shape could change! """    643         1           13     13.0      0.0      if not isinstance(dtype, np.dtype):    644                                                   dtype = pandas_dtype(dtype)    645                                               646         1            8      8.0      0.0      if issubclass(dtype.type, text_type):    647                                                   # in Py3 that's str, in Py2 that's unicode    648         1       663317 663317.0    100.0          return lib.astype_unicode(arr.ravel()).reshape(arr.shape)    ... 

Source

Again one it's one line that takes 100%, so I'll go one function further:

%lprun -f pd.core.dtypes.cast.lib.astype_unicode x.astype(str)  UserWarning: Could not extract a code object for the object <built-in function astype_unicode> 

Okay, we found a built-in function, that means it's a C function. In this case it's a Cython function. But it means we cannot dig deeper with line-profiler. So I'll stop here for now.

Profiling x.apply

%lprun -f x.apply x.apply(str) 
Line #      Hits         Time  Per Hit   % Time  Line Contents ==============================================================   2426                                               def apply(self, func, convert_dtype=True, args=(), **kwds):   2427                                                   """   ...   2523                                                   """   2524         1           84     84.0      0.0          if len(self) == 0:   2525                                                       return self._constructor(dtype=self.dtype,   2526                                                                                index=self.index).__finalize__(self)   2527                                              2528                                                   # dispatch to agg   2529         1           11     11.0      0.0          if isinstance(func, (list, dict)):   2530                                                       return self.aggregate(func, *args, **kwds)   2531                                              2532                                                   # if we are a string, try to dispatch   2533         1           12     12.0      0.0          if isinstance(func, compat.string_types):   2534                                                       return self._try_aggregate_string_function(func, *args, **kwds)   2535                                              2536                                                   # handle ufuncs and lambdas   2537         1            7      7.0      0.0          if kwds or args and not isinstance(func, np.ufunc):   2538                                                       f = lambda x: func(x, *args, **kwds)   2539                                                   else:   2540         1            6      6.0      0.0              f = func   2541                                              2542         1          154    154.0      0.1          with np.errstate(all='ignore'):   2543         1           11     11.0      0.0              if isinstance(f, np.ufunc):   2544                                                           return f(self)   2545                                              2546                                                       # row-wise access   2547         1          188    188.0      0.1              if is_extension_type(self.dtype):   2548                                                           mapped = self._values.map(f)   2549                                                       else:   2550         1         6238   6238.0      3.3                  values = self.asobject   2551         1       181910 181910.0     95.5                  mapped = lib.map_infer(values, f, convert=convert_dtype)   2552                                              2553         1           28     28.0      0.0          if len(mapped) and isinstance(mapped[0], Series):   2554                                                       from pandas.core.frame import DataFrame   2555                                                       return DataFrame(mapped.tolist(), index=self.index)   2556                                                   else:   2557         1           19     19.0      0.0              return self._constructor(mapped,   2558         1         1870   1870.0      1.0                                       index=self.index).__finalize__(self) 

Source

Again it's one function that takes most of the time: lib.map_infer ...

%lprun -f pd.core.series.lib.map_infer x.apply(str) 
Could not extract a code object for the object <built-in function map_infer> 

Okay, that's another Cython function.

This time there's another (although less significant) contributor with ~3%: values = self.asobject. But I'll ignore this for now, because we're interested in the major contributors.

Going into C/Cython

The functions called by astype

This is the astype_unicode function:

cpdef ndarray[object] astype_unicode(ndarray arr):     cdef:         Py_ssize_t i, n = arr.size         ndarray[object] result = np.empty(n, dtype=object)      for i in range(n):         # we can use the unsafe version because we know `result` is mutable         # since it was created from `np.empty`         util.set_value_at_unsafe(result, i, unicode(arr[i]))      return result 

Source

This function uses this helper:

cdef inline set_value_at_unsafe(ndarray arr, object loc, object value):     cdef:         Py_ssize_t i, sz     if is_float_object(loc):         casted = int(loc)         if casted == loc:             loc = casted     i = <Py_ssize_t> loc     sz = cnp.PyArray_SIZE(arr)      if i < 0:         i += sz     elif i >= sz:         raise IndexError('index out of bounds')      assign_value_1d(arr, i, value) 

Source

Which itself uses this C function:

PANDAS_INLINE int assign_value_1d(PyArrayObject* ap, Py_ssize_t _i,                                   PyObject* v) {     npy_intp i = (npy_intp)_i;     char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);     return PyArray_DESCR(ap)->f->setitem(v, item, ap); } 

Source

Functions called by apply

This is the implementation of the map_infer function:

def map_infer(ndarray arr, object f, bint convert=1):     cdef:         Py_ssize_t i, n         ndarray[object] result         object val      n = len(arr)     result = np.empty(n, dtype=object)     for i in range(n):         val = f(util.get_value_at(arr, i))          # unbox 0-dim arrays, GH #690         if is_array(val) and PyArray_NDIM(val) == 0:             # is there a faster way to unbox?             val = val.item()          result[i] = val      if convert:         return maybe_convert_objects(result,                                      try_float=0,                                      convert_datetime=0,                                      convert_timedelta=0)      return result 

Source

With this helper:

cdef inline object get_value_at(ndarray arr, object loc):     cdef:         Py_ssize_t i, sz         int casted      if is_float_object(loc):         casted = int(loc)         if casted == loc:             loc = casted     i = <Py_ssize_t> loc     sz = cnp.PyArray_SIZE(arr)      if i < 0 and sz > 0:         i += sz     elif i >= sz or sz == 0:         raise IndexError('index out of bounds')      return get_value_1d(arr, i) 

Source

Which uses this C function:

PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) {     char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);     return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*)ap); } 

Source

Some thoughts on the Cython code

There are some differences between the Cython codes that are called eventually.

The one taken by astype uses unicode while the apply path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):

%load_ext cython  %%cython  import numpy as np cimport numpy as np  cpdef object func_called_by_astype(np.ndarray arr):     cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)     for i in range(arr.size):         ret[i] = unicode(arr[i])     return ret  cpdef object func_called_by_apply(np.ndarray arr, object f):     cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)     for i in range(arr.size):         ret[i] = f(arr[i])     return ret 

Timing:

import numpy as np  arr = np.random.randint(0, 10000, 1000000) %timeit func_called_by_astype(arr) 514 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit func_called_by_apply(arr, str) 632 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Okay, there is a difference but it's wrong, it would actually indicate that apply would be slightly slower.

But remember the asobject call that I mentioned earlier in the apply function? Could that be the reason? Let's see:

import numpy as np  arr = np.random.randint(0, 10000, 1000000) %timeit func_called_by_astype(arr) 557 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit func_called_by_apply(arr.astype(object), str) 317 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this: str is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtype object).

However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:

val = f(util.get_value_at(arr, i)) if is_array(val) and PyArray_NDIM(val) == 0:     val = val.item() result[i] = val 

part of the map_infer function is faster than:

for i in range(n):     # we can use the unsafe version because we know `result` is mutable     # since it was created from `np.empty`     util.set_value_at_unsafe(result, i, unicode(arr[i])) 

which is called by the astype(str) path. The comments of the first function seem to indicate that the writer of map_infer actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.

Also on my computer I'm actually quite close to the performance of the x.astype(str) and x.apply(str) already:

import numpy as np  arr = np.random.randint(0, 100, 1000000) s = pd.Series(arr) %timeit s.astype(str) 535 ms ± 23.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit func_called_by_astype(arr) 547 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   %timeit s.apply(str) 216 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit func_called_by_apply(arr.astype(object), str) 272 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 

Note that I also checked some other variants that return a different result:

%timeit s.values.astype(str)  # array of strings 407 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit list(map(str, s.values.tolist()))  # list of strings 184 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

Interestingly the Python loop with list and map seems to be the fastest on my computer.

I actually made a small benchmark including plot:

import pandas as pd import simple_benchmark  def Series_astype(series):     return series.astype(str)  def Series_apply(series):     return series.apply(str)  def Series_tolist_map(series):     return list(map(str, series.values.tolist()))  def Series_values_astype(series):     return series.values.astype(str)   arguments = {2**i: pd.Series(np.random.randint(0, 100, 2**i)) for i in range(2, 20)} b = simple_benchmark.benchmark(     [Series_astype, Series_apply, Series_tolist_map, Series_values_astype],     arguments,     argument_name='Series size' )  %matplotlib notebook b.plot() 

enter image description here

Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.

The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:

Versions -------- Python 3.6.5 NumPy 1.14.2 Pandas 0.22.0 
like image 89
MSeifert Avatar answered Sep 18 '22 18:09

MSeifert