We have a dataframe with three different columns, like shown in the example above (df). The goal of this task is to replace the first element of the column 2 by a np.nan, everytime the letter in the column 1 changes. Since the database under study is very big, it cannot be used a for loop. Also every solution that involves a shift is excluded because it is too slow.
I believe the easiest way is to use the groupby and the head method, however I don't know how to replace in the original dataframe.
Examples:
df = pd.DataFrame([['A','Z',1.11],['B','Z',2.1],['C','Z',3.1],['D', 'X', 2.1], ['E','X',4.3],['E', 'X', 2.1], ['F','X',4.3]])
to select the elements that we want to change, we can do the following:
df.groupby(by=1).head(1)[2] = np.nan
However in the original dataframe nothing changes.
The goal is to obtain the following:
Based on comments, we won't df[1]
returning to a group already seen, e.g. ['Z', 'Z', 'X', 'Z']
is not possible.
mask
and shift
df[2] = df[2].mask(df[1].ne(df[1].shift(1)))
masked_array
:df[2] = np.ma.masked_array(df[2], df[1].ne(df[1].shift(1))).filled(np.nan)
# array([nan, 2.1, 3.1, nan, 4.3, 2.1, 4.3])
np.roll
and loc
:a = df[1].values
df.loc[np.roll(a, 1)!=a, 2] = np.nan
0 1 2
0 A Z NaN
1 B Z 2.1
2 C Z 3.1
3 D X NaN
4 E X 4.3
5 E X 2.1
6 F X 4.3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With