My dataframe looks like this;
df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
If col1 contains the value 1 in column 2 I want to forward fill with 1 n number of times. For example, if n = 4 then I would need the result to look like this.
df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1]})
I think I could do this using a for loop with a counter that resets every time a condition occurs but is there a faster way to produce the same result?
Thanks!
ffill() function is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. inplace : If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
Recap of Pandas' fillna method='ffill': Ffill or forward-fill propagates the last observed non-null value forward until another non-null value is encountered. method='bfill': Bfill or backward-fill propagates the first observed non-null value backward until another non-null value is met.
bfill() is used to backward fill the missing values in the dataset. It will backward fill the NaN values that are present in the pandas dataframe.
Approach #1 : A NumPy based one with 1D convolution
-
N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
A more compact one-liner -
df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')
Approach #2 : Here's one with SciPy's binary_dilation
-
from scipy.ndimage.morphology import binary_dilation
N = 4 # window size
K = np.ones(N,dtype=bool)
df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
Approach #3 : Squeeze out the best off NumPy with it's strided-view-based tool -
from skimage.util.shape import view_as_windows
N = 4 # window size
mask = df.Col1.values==1
w = view_as_windows(mask,N)
idx = len(df)-(N-mask[-N:].argmax())
if mask[-N:].any():
mask[idx:idx+N-1] = 1
w[mask[:-N+1]] = 1
df['Col2'] = mask.view('i1')
Setup with given sample scaled up by 10,000x
-
In [67]: df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
...: ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
...:
...: df = pd.concat([df]*10000)
...: df.index = range(len(df.index))
Timings
# @jezrael's soln
In [68]: %%timeit
...: n = 3
...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# App-1 from this post
In [72]: %%timeit
...: N = 4 # window size
...: K = np.ones(N,dtype=bool)
...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# App-2 from this post
In [70]: %%timeit
...: N = 4 # window size
...: K = np.ones(N,dtype=bool)
...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# App-3 from this post
In [35]: %%timeit
...: N = 4 # window size
...: mask = df.Col1.values==1
...: w = view_as_windows(mask,N)
...: idx = len(df)-(N-mask[-N:].argmax())
...: if mask[-N:].any():
...: mask[idx:idx+N-1] = 1
...: w[mask[:-N+1]] = 1
...: df['Col2_4'] = mask.view('i1')
1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# @yatu's soln
In [71]: %%timeit
...: n = 4
...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
...: df.loc[ix, 'Col2_5'] = 1
7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For general solution replace non 1
values to missing values by Series.where
and forward filling 1
values with limit parameter, last replace missing values by originals:
n = 3
df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
print (df)
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
Here's a NumPy based approach using np.flatnonzero
to obtain the indices where Col1
is 1 and taking the broadcast sum
with an arange up to n
:
n = 4
ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
df.loc[ix, 'Col2'] = 1
print(df)
Col1 Col2
0 0 0
1 1 1
2 0 1
3 0 1
4 0 1
5 0 0
6 0 0
7 0 0
8 1 1
9 0 1
10 0 1
11 0 1
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 1 1
18 0 1
19 0 1
20 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With