I would like to use reduce and accumulate functions in Pandas in a way similar to how they apply in native python with lists. In itertools and functools implementations, reduce and accumulate (sometimes called fold and cumulative fold in other languages) require a function with two arguments. In Pandas, there is no similar implementation. The function takes two parameters: f(accumulated_value,popped_value)
So, I have a list of binary variables and want to calculate the number of duration when we are in the 1 state:
In [1]: from itertools import accumulate
import pandas as pd
drawdown_periods = [0,1,1,1,0,0,0,1,1,1,1,0,1,1,0]
applying accumulate to this with the lambda function
lambda x,y: (x+y)*y
gives
In [2]: list(accumulate(drawdown_periods, lambda x,y: (x+y)*y))
Out[2]: [0, 1, 2, 3, 0, 0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
counting the length of each drawdown_period.
Is there is a smart but quirky way to supply a lambda function with two arguments? I may be missing a trick here.
I know that there is a lovely recipe with groupby (see StackOverflow How to calculate consecutive Equal Values in Pandas/How to emulate itertools.groupby with a series/dataframe). I'll repeat it since it's so lovely:
In [3]: df = pd.DataFrame(data=drawdown_periods, columns=['dd'])
df['dd'].groupby((df['dd'] != df['dd'].shift()).cumsum()).cumsum()
Out[3]:
0 0
1 1
2 2
3 3
4 0
5 0
6 0
7 1
8 2
9 3
10 4
11 0
12 1
13 2
14 0
Name: dd, dtype: int64
This is not the solution I want. I need a way of passing a two-parameter lambda function, to a pandas-native reduce/accumulate functions, since this will also work for many other functional programming recipes.
The applymap() function is used to apply a function to a Dataframe elementwise. This method applies a function that accepts and returns a scalar to every element of a DataFrame. Python function, returns a single value from a single value.
Order will always be preserved. When you use the list function, you provide it an iterator, and construct a list by iterating over it.
Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.
Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation.
You could get this to work with an efficiency penalty using numpy
. In practice, you may be better writing ad hoc vectorized solutions.
Using np.frompyfunc
:
s = pd.Series([0,1,1,1,0,0,0,1,1,1,1,0,1,1,0])
f = numpy.frompyfunc(lambda x, y: (x+y) * y, 2, 1)
f.accumulate(series.astype(object))
0 0
1 1
2 2
3 3
4 0
5 0
6 0
7 1
8 2
9 3
10 4
11 0
12 1
13 2
14 0
dtype: object
What you are looking for would be a pandas method that would extract all objects from a Series, convert them to Python object, call a Python function and have an accumulator that is also a Python object.
This kind of behavior does not scale well when you have a lot of data, as there is a lot of time/memory overhead in wrapping the raw data in Python objects. Pandas methods try to work directly on the underlying (numpy) raw data, being able to process lots of data without having to wrap them in Python objects. The groupby+cumsum example you give is a clever way of avoiding the use of .apply
and Python functions, which would be slower.
Nevertheless, you are of course free to do your own functional thing in Python if you don't care about the performance. As it's all Python anyway and there's no way of speeding it up on the pandas side, you can just write your own:
df["cev"] = list(accumulate(df.dd, lambda x,y:(x+y)*y))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With