Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use `apply()` or other vectorized approach when previous value matters

Assume I have a DataFrame of the following form where the first column is a random number, and the other columns will be based on the value in the previous column.

enter image description here

For ease of use, let's say I want each number to be the previous one squared. So it would look like the below.

enter image description here

I know I can write a pretty simple loop to do this, but I also know looping is not usually the most efficient in python/pandas. How could this be done with apply() or rolling_apply()? Or, otherwise be done more efficiently?

My (failed) attempts below:

In [12]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

In [13]: a
Out[13]: 
   0  1  2  3
0  1  0  0  0
1  2  0  0  0
2  3  0  0  0
3  4  0  0  0
4  5  0  0  0

In [14]: a = a.apply(lambda x: x**2)

In [15]: a
Out[15]: 
    0  1  2  3
0   1  0  0  0
1   4  0  0  0
2   9  0  0  0
3  16  0  0  0
4  25  0  0  0


In [16]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

In [17]: pandas.rolling_apply(a,1,lambda x: x**2)
C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\spyderlib\widgets\externalshell\start_ipython_kernel.py:1: FutureWarning: pd.rolling_apply is deprecated for DataFrame and will be removed in a future version, replace with 
        DataFrame.rolling(center=False,window=1).apply(args=<tuple>,kwargs=<dict>,func=<function>)
  # -*- coding: utf-8 -*-
Out[17]: 
      0    1    2    3
0   1.0  0.0  0.0  0.0
1   4.0  0.0  0.0  0.0
2   9.0  0.0  0.0  0.0
3  16.0  0.0  0.0  0.0
4  25.0  0.0  0.0  0.0

In [18]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

In [19]: a = a[:-1]**2

In [20]: a
Out[20]: 
    0  1  2  3
0   1  0  0  0
1   4  0  0  0
2   9  0  0  0
3  16  0  0  0

In [21]: 

So, my issue is mostly how to refer to the previous column value in my DataFrame calculations.

like image 552
Kyle Avatar asked Dec 11 '22 12:12

Kyle


1 Answers

What you're describing is a recurrence relation, and I don't think there is currently any non-loop way to do that. Things like apply and rolling_apply still rely on having all the needed data available before they begin, and outputting all the result data at once at the end. That is, they don't allow you to compute the next value using earlier values of the same series. See this question and this one as well as this pandas issue.

In practical terms, for your example, you only have three columns you want to fill in, so doing a three-pass loop (as shown in some of the other answers) will probably not be a major performance hit.

like image 117
BrenBarn Avatar answered Dec 13 '22 21:12

BrenBarn