Let's say I have the following DataFrame:
df = pd.DataFrame({'player': ['LBJ', 'LBJ', 'LBJ', 'Kyrie', 'Kyrie', 'LBJ', 'LBJ'],
'points': [25, 32, 26, 21, 29, 21, 35]})
How can I perform the operation opposite of ffill so I can get the following DataFrame:
df = pd.DataFrame({'player': ['LBJ', np.nan, np.nan, 'Kyrie', np.nan, 'LBJ', np.nan],
'points': [25, 32, 26, 21, 29, 21, 35]})
That is, I want to fill directly repeated values with NaN.
Here's what I have so far but I'm hoping there's a built-in pandas method or a better approach:
for i, (index, row) in enumerate(df.iterrows()):
if i == 0:
continue
go_back = 1
while True:
past_player = df.ix[i-go_back, 'player']
if pd.isnull(past_player):
go_back += 1
continue
if row['player'] == past_player:
df.set_value(index, 'player', value=np.nan)
break
Pandas DataFrame ffill() Method The ffill() method replaces the NULL values with the value from the previous row (or previous column, if the axis parameter is set to 'columns' ).
ffill() function is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and will propagate last valid observation forward.
method='ffill': Ffill or forward-fill propagates the last observed non-null value forward until another non-null value is encountered. method='bfill': Bfill or backward-fill propagates the first observed non-null value backward until another non-null value is met.
subtract() function is used for finding the subtraction of dataframe and other, element-wise. This function is essentially same as doing dataframe – other but with a support to substitute for missing data in one of the inputs.
ffinv = lambda s: s.mask(s == s.shift())
df.assign(player=ffinv(df.player))
player points
0 LBJ 25
1 NaN 32
2 NaN 26
3 Kyrie 21
4 NaN 29
5 LBJ 21
6 NaN 35
Probably not the most efficient solution but working would be to use itertools.groupby
and itertools.chain
:
>>> df['player'] = list(itertools.chain.from_iterable([key] + [float('nan')]*(len(list(val))-1)
for key, val in itertools.groupby(df['player'].tolist())))
>>> df
player points
0 LBJ 25
1 NaN 32
2 NaN 26
3 Kyrie 21
4 NaN 29
5 LBJ 21
6 NaN 35
More specifically this illustrates how it works:
for key, val in itertools.groupby(df['player']):
print([key] + [float('nan')]*(len(list(val))-1))
giving:
['LBJ', nan, nan]
['Kyrie', nan]
['LBJ', nan]
which is then "chained" together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With