Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up an iloc solution within a pandas dataframe

I have the following DataFrame:

dates = pd.date_range('20150101', periods=4)
df = pd.DataFrame({'A' : [5,10,3,4]}, index = dates)

df.loc[:,'B'] = 0
df.loc[:,'C'] = 0
df.iloc[0,1]  = 10
df.iloc[0,2]  = 3

print df

Out[69]:

             A   B  C
2015-01-01   5  10  3
2015-01-02  10   0  0
2015-01-03   3   0  0
2015-01-04   4   0  0

I want to implement the following logic for the columns B and C:

  • B(k+1) = B(k) - A(k+1)
  • C(k+1) = B(k) + A(k+1)

I can do this using the following code:

for i in range (1, df.shape[0]): 
        df.iloc[i,1] = df.iloc[i-1,1] - df.iloc[i,0] 
        df.iloc[i,2] = df.iloc[i-1,1] + df.iloc[i,0] 
print df

This gives:

             A   B   C
2015-01-01   5  10   3
2015-01-02  10   0  20
2015-01-03   3  -3   3
2015-01-04   4  -7   1

Which is the answer I'm looking for. The problem is when I apply this to a DataFrame with a large dataset it runs slow. Very slow. Is there a better way of achieving this?

like image 622
Anthony W Avatar asked Oct 17 '15 08:10

Anthony W


1 Answers

A trick to vectorize is to rewrite everything as cumsums.

In [11]: x = df["A"].shift(-1).cumsum().shift().fillna(0)

In [12]: x
Out[12]:
2015-01-01     0
2015-01-02    10
2015-01-03    13
2015-01-04    17
Name: A, dtype: float64

In [13]: df["B"].cumsum() - x
Out[13]:
2015-01-01    10
2015-01-02     0
2015-01-03    -3
2015-01-04    -7
dtype: float64

In [14]: df["B"].cumsum() - x + 2 * df["A"]
Out[14]:
2015-01-01    20
2015-01-02    20
2015-01-03     3
2015-01-04     1
dtype: float64

Note: The first value is a special case so you have to adjust that back to 3.

like image 101
Andy Hayden Avatar answered Oct 10 '22 23:10

Andy Hayden