Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Reduce" function for Series

Is there an analog for reduce for a pandas Series?

For example, the analog for map is pd.Series.apply, but I can't find any analog for reduce.


My application is, I have a pandas Series of lists:

>>> business["categories"].head()  0                      ['Doctors', 'Health & Medical'] 1                                        ['Nightlife'] 2                 ['Active Life', 'Mini Golf', 'Golf'] 3    ['Shopping', 'Home Services', 'Internet Servic... 4    ['Bars', 'American (New)', 'Nightlife', 'Loung... Name: categories, dtype: object 

I'd like to merge the Series of lists together using reduce, like so:

categories = reduce(lambda l1, l2: l1 + l2, categories) 

but this takes a horrific time because merging two lists together is O(n) time in Python. I'm hoping that pd.Series has a vectorized way to perform this faster.

like image 722
hlin117 Avatar asked Jan 26 '16 00:01

hlin117


People also ask

What is reduce () in Python?

Python's reduce() is a function that implements a mathematical technique called folding or reduction. reduce() is useful when you need to apply a function to an iterable and reduce it to a single cumulative value.

What is the use of reduce () function?

The reduce() method executes a reducer function for array element. The reduce() method returns a single value: the function's accumulated result. The reduce() method does not execute the function for empty array elements. The reduce() method does not change the original array.

How do you reduce a list in Python?

Python offers a function called reduce() that allows you to reduce a list in a more concise way. The reduce() function applies the fn function of two arguments cumulatively to the items of the list, from left to right, to reduce the list into a single value.

Is reduce faster than for loop Python?

Obviously, reduce does loop faster than for, but the function call seems to dominate.


1 Answers

With itertools.chain() on the values

This could be faster:

from itertools import chain categories = list(chain.from_iterable(categories.values)) 

Performance

from functools import reduce from itertools import chain  categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)  %timeit list(chain.from_iterable(categories.values)) 1000 loops, best of 3: 231 µs per loop  %timeit list(chain(*categories.values.flat)) 1000 loops, best of 3: 237 µs per loop  %timeit reduce(lambda l1, l2: l1 + l2, categories) 100 loops, best of 3: 15.8 ms per loop 

For this data set the chaining is about 68x faster.

Vectorization?

Vectorization works when you have native NumPy data types (pandas uses NumPy for its data after all). Since we have lists in the Series already and want a list as result, it is rather unlikely that vectorization will speed things up. The conversion between standard Python objects and pandas/NumPy data types will likely eat up all the performance you might get from the vectorization. I made one attempt to vectorize the algorithm in another answer.

like image 76
Mike Müller Avatar answered Sep 21 '22 04:09

Mike Müller