Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas 'reduce' and 'accumulate' functions - incomplete implementation

I would like to use reduce and accumulate functions in Pandas in a way similar to how they apply in native python with lists. In itertools and functools implementations, reduce and accumulate (sometimes called fold and cumulative fold in other languages) require a function with two arguments. In Pandas, there is no similar implementation. The function takes two parameters: f(accumulated_value,popped_value)

So, I have a list of binary variables and want to calculate the number of duration when we are in the 1 state:

In [1]: from itertools import accumulate
        import pandas as pd
        drawdown_periods = [0,1,1,1,0,0,0,1,1,1,1,0,1,1,0]

applying accumulate to this with the lambda function

lambda x,y: (x+y)*y

gives

In [2]: list(accumulate(drawdown_periods, lambda x,y: (x+y)*y))
Out[2]: [0, 1, 2, 3, 0, 0, 0, 1, 2, 3, 4, 0, 1, 2, 0]

counting the length of each drawdown_period.

Is there is a smart but quirky way to supply a lambda function with two arguments? I may be missing a trick here.

I know that there is a lovely recipe with groupby (see StackOverflow How to calculate consecutive Equal Values in Pandas/How to emulate itertools.groupby with a series/dataframe). I'll repeat it since it's so lovely:

In [3]: df = pd.DataFrame(data=drawdown_periods, columns=['dd'])
       df['dd'].groupby((df['dd'] != df['dd'].shift()).cumsum()).cumsum()
Out[3]:
    0     0
    1     1
    2     2
    3     3
    4     0
    5     0
    6     0
    7     1
    8     2
    9     3
    10    4
    11    0
    12    1
    13    2
    14    0
    Name: dd, dtype: int64   

This is not the solution I want. I need a way of passing a two-parameter lambda function, to a pandas-native reduce/accumulate functions, since this will also work for many other functional programming recipes.

like image 565
NBF Avatar asked May 30 '18 11:05

NBF


People also ask

What does Applymap do in pandas?

The applymap() function is used to apply a function to a Dataframe elementwise. This method applies a function that accepts and returns a scalar to every element of a DataFrame. Python function, returns a single value from a single value.

Does pandas Tolist preserve order?

Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it.

How do I reduce panda memory usage?

Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.

Does pandas use lazy evaluation?

Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation.


2 Answers

You could get this to work with an efficiency penalty using numpy. In practice, you may be better writing ad hoc vectorized solutions.

Using np.frompyfunc:

s = pd.Series([0,1,1,1,0,0,0,1,1,1,1,0,1,1,0])
f = numpy.frompyfunc(lambda x, y: (x+y) * y, 2, 1)
f.accumulate(series.astype(object))

0     0
1     1
2     2
3     3
4     0
5     0
6     0
7     1
8     2
9     3
10    4
11    0
12    1
13    2
14    0
dtype: object
like image 79
hilberts_drinking_problem Avatar answered Sep 30 '22 05:09

hilberts_drinking_problem


What you are looking for would be a pandas method that would extract all objects from a Series, convert them to Python object, call a Python function and have an accumulator that is also a Python object.

This kind of behavior does not scale well when you have a lot of data, as there is a lot of time/memory overhead in wrapping the raw data in Python objects. Pandas methods try to work directly on the underlying (numpy) raw data, being able to process lots of data without having to wrap them in Python objects. The groupby+cumsum example you give is a clever way of avoiding the use of .apply and Python functions, which would be slower.

Nevertheless, you are of course free to do your own functional thing in Python if you don't care about the performance. As it's all Python anyway and there's no way of speeding it up on the pandas side, you can just write your own:

df["cev"] = list(accumulate(df.dd, lambda x,y:(x+y)*y))
like image 29
w-m Avatar answered Sep 30 '22 03:09

w-m