Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Forward fill column on condition [closed]

My dataframe looks like this;

df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})

If col1 contains the value 1 in column 2 I want to forward fill with 1 n number of times. For example, if n = 4 then I would need the result to look like this.

df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1]})

I think I could do this using a for loop with a counter that resets every time a condition occurs but is there a faster way to produce the same result?

Thanks!

like image 312
redbaron1981 Avatar asked Mar 22 '20 15:03

redbaron1981


People also ask

What is forward fill method?

ffill() function is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. inplace : If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).

What is forward fill and backward fill pandas?

Recap of Pandas' fillna method='ffill': Ffill or forward-fill propagates the last observed non-null value forward until another non-null value is encountered. method='bfill': Bfill or backward-fill propagates the first observed non-null value backward until another non-null value is met.

What is backward fill in pandas?

bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe.


3 Answers

Approach #1 : A NumPy based one with 1D convolution -

N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')

A more compact one-liner -

df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')

Approach #2 : Here's one with SciPy's binary_dilation -

from scipy.ndimage.morphology import binary_dilation

N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')

Approach #3 : Squeeze out the best off NumPy with it's strided-view-based tool -

from skimage.util.shape import view_as_windows

N = 4 # window size
mask = df.Col1.values==1
w = view_as_windows(mask,N)
idx = len(df)-(N-mask[-N:].argmax())
if mask[-N:].any():
    mask[idx:idx+N-1] = 1
w[mask[:-N+1]] = 1
df['Col2'] = mask.view('i1')

Benchmarking

Setup with given sample scaled up by 10,000x -

In [67]: df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
    ...:                    ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
    ...: 
    ...: df = pd.concat([df]*10000)
    ...: df.index = range(len(df.index))

Timings

# @jezrael's soln
In [68]: %%timeit
    ...: n = 3
    ...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# App-1 from this post
In [72]: %%timeit
    ...: N = 4 # window size
    ...: K = np.ones(N,dtype=bool)
    ...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# App-2 from this post
In [70]: %%timeit
    ...: N = 4 # window size
    ...: K = np.ones(N,dtype=bool)
    ...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# App-3 from this post
In [35]: %%timeit
    ...: N = 4 # window size
    ...: mask = df.Col1.values==1
    ...: w = view_as_windows(mask,N)
    ...: idx = len(df)-(N-mask[-N:].argmax())
    ...: if mask[-N:].any():
    ...:     mask[idx:idx+N-1] = 1
    ...: w[mask[:-N+1]] = 1
    ...: df['Col2_4'] = mask.view('i1')
1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# @yatu's soln
In [71]: %%timeit
    ...: n = 4
    ...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
    ...: df.loc[ix, 'Col2_5'] = 1
7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 144
Divakar Avatar answered Oct 13 '22 06:10

Divakar


For general solution replace non 1 values to missing values by Series.where and forward filling 1 values with limit parameter, last replace missing values by originals:

n = 3
df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)

print (df)
    Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1
like image 3
jezrael Avatar answered Oct 13 '22 05:10

jezrael


Here's a NumPy based approach using np.flatnonzero to obtain the indices where Col1 is 1 and taking the broadcast sum with an arange up to n:

n = 4
ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
df.loc[ix, 'Col2'] = 1

print(df)

     Col1  Col2
0      0     0
1      1     1
2      0     1
3      0     1
4      0     1
5      0     0
6      0     0
7      0     0
8      1     1
9      0     1
10     0     1
11     0     1
12     0     0
13     0     0
14     0     0
15     0     0
16     0     0
17     1     1
18     0     1
19     0     1
20     0     1
like image 3
yatu Avatar answered Oct 13 '22 07:10

yatu