Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

looking for an efficient way to iterate

may i ask for suggestion to iterate more efficiently (and faster)? here is the issue, I am looking for a way to propagate zeros down in a pandas DataFrame within a determined window size:

import numpy as np
import pandas as pd

A = np.matrix([[ 0.,  1.,  1.,  1.,  1.],
           [ 1.,  0.,  1.,  1.,  1.],
           [ 1.,  1.,  0.,  1.,  1.],
           [ 1.,  1.,  1.,  0.,  1.],
           [ 1.,  1.,  1.,  1.,  0.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  0.],
           [ 1.,  1.,  0.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  0.,  1.],
           [ 1.,  1.,  1.,  1.,  1.],
           [ 1.,  1.,  0.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  0.],
           [ 1.,  0.,  1.,  1.,  1.],
           [ 1.,  1.,  1.,  1.,  1.]])

df = pd.DataFrame(A)

now we want to fill by increment of windows of 3 lines each the values from the top. each window of 3 lines starts on window_start, defined as:

window_size = 3
window_start = [i for i in range(0, df.shape[0]) 
                if i % window_size == 0]
print(df)
gf = df.copy()
print('\n')

now making the DataFrame where the zeros propagate from the preceding lines above inside that window:

for i in window_start:
for j in range(1, window_size):
    try: gf.iloc[i + j] = gf.iloc[i + j - 1] * gf.iloc[i + j]
    except: pass

print(gf)

This last bit is pretty inefficient and time consuming for very large datasets, is there a better way to do it?

like image 523
Cy Bu Avatar asked May 04 '17 14:05

Cy Bu


1 Answers

You should be able to accomplish this task with a cumulative product within a groupby.

df.groupby(np.arange(len(df)) // 3).cumprod()

      0    1    2    3    4
0   0.0  1.0  1.0  1.0  1.0
1   0.0  0.0  1.0  1.0  1.0
2   0.0  0.0  0.0  1.0  1.0
3   1.0  1.0  1.0  0.0  1.0
4   1.0  1.0  1.0  0.0  0.0
5   1.0  1.0  1.0  0.0  0.0
6   1.0  1.0  1.0  1.0  1.0
7   1.0  1.0  1.0  1.0  1.0
8   1.0  1.0  1.0  1.0  0.0
9   1.0  1.0  0.0  1.0  1.0
10  1.0  1.0  0.0  1.0  1.0
11  1.0  1.0  0.0  1.0  1.0
12  1.0  1.0  1.0  1.0  1.0
13  1.0  1.0  1.0  1.0  1.0
14  1.0  1.0  1.0  0.0  1.0
15  1.0  1.0  1.0  1.0  1.0
16  1.0  1.0  0.0  1.0  1.0
17  1.0  1.0  0.0  1.0  0.0
18  1.0  0.0  1.0  1.0  1.0
19  1.0  0.0  1.0  1.0  1.0

We can take a better look, by using concat to see if it's doing what we want.

pd.concat([df.iloc[:6, :2], d1.iloc[:6, :2]], axis=1, keys=['Before', 'After'])

  Before      After     
       0    1     0    1
0    0.0  1.0   0.0  1.0
1    1.0  0.0   0.0  0.0
2    1.0  1.0   0.0  0.0
3    1.0  1.0   1.0  1.0
4    1.0  1.0   1.0  1.0
5    1.0  1.0   1.0  1.0

My Take on a numpy approach
See @Divakar's solution as I borrowed some elements of his function

def prop_zero(df, window_size=3):
    a = df.values
    W = window_size
    m, n = a.shape

    pad = np.zeros((W - m % W, n))
    b = np.vstack([a, pad])

    return pd.DataFrame(
        b.reshape(-1, W, n).cumprod(1).reshape(-1, n)[:m],
        df.index, df.columns
    )

prop_zero(df)
like image 199
piRSquared Avatar answered Sep 16 '22 14:09

piRSquared