I would like to create column C from column B without a for loop...
dataframe:
# | A | B | C
--+-----+----+-----
1 | 2 | 3 | 4
2 | 3 | 3 | 4
3 | 4 | 4 | 6
4 | 5 | 4 | 6
5 | 5 | 4 | 6
6 | 3 | 6 | 2
7 | 2 | 6 | 2
8 | 4 | 2 | 3 #< --- loop back around if possible (B value at index 1)
Essentially I want to get the value of the next change in B and set it to a new column C.
So far with the answer from : Determining when a column value changes in pandas dataframe I have:
df_filtered = df[df['B'].diff() != 0]
But after that I'm not sure how to create C without using a loop...
EDIT: @(Ayoub ZAROU)'s answer answers my original question, however, I noticed my example dataframe doesn't cover all cases if we are assuming a loop in the data:
# | A | B | C
--+-----+----+-----
1 | 2 | 3 | 4
2 | 3 | 3 | 4
3 | 4 | 4 | 6
4 | 5 | 4 | 6
5 | 5 | 4 | 6
6 | 3 | 6 | 2
7 | 2 | 6 | 2
8 | 4 | 2 | 3
9 | 3 | 3 | 4
10| 2 | 3 | 4
In this case, if the last segment of 3's is considered to be part of the first segment of 3's, the last two values in C will be incorrect using this solution.
An easy fix however is to move the last few elements to the beginning of the list or vice versa
you could try, note that np.roll
is the same as shift in pandas, the only difference is that it allows you to roll the values over,
In the following, c
gives you the indexes where there is no change
c = (df.B.diff(-1) == 0)
c
Out[104]:
0 True
1 False
2 True
3 True
4 False
5 True
6 False
7 False
Name: B, dtype: bool
we set then the values there to the next value on the B
column yieldied using np.roll and set using pandas.Series.where
, note that where changes the values where the change column c
is not True
,
df['C'] = np.nan
df['C'] = df.C.where(c, np.roll(df.B, -1))
df.C
Out[107]:
0 NaN
1 4.0
2 NaN
3 NaN
4 6.0
5 NaN
6 2.0
7 3.0
Name: C, dtype: float64
we then fill the remaining rows using bfill
on pandas and cast it it the B
' column dtype
,
So , in global, you do
c = (df.B.diff(-1) == 0)
df['C'] = np.nan
df['C'] = df.C.where(c, np.roll(df.B, -1)).bfill().astype(df.B.dtype)
df.C
Out[110]:
0 4
1 4
2 6
3 6
4 6
5 2
6 2
7 3
Name: C, dtype: int32
Another way is to get the value changes:
In [11]: changes = (df.B != df.B.shift()).cumsum()
In [12]: changes
Out[12]:
0 1
1 1
2 2
3 2
4 2
5 3
6 3
7 4
Name: B, dtype: int64
and a lookup map:
In [13]: lookup = df.B[(df.B != df.B.shift())]
In [14]: lookup.at[len(lookup)] = df.B.iloc[0]
In [15]: lookup
Out[15]:
0 3
2 4
5 6
7 2
4 3
Name: B, dtype: int64
Then use these to lookup the "next":
In [16]: lookup.iloc[changes]
Out[16]:
2 4
2 4
5 6
5 6
5 6
7 2
7 2
4 3
Name: B, dtype: int64
To create the column you need to ignore the duplicates in the index:
In [17]: df["C"] = lookup.iloc[changes].values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With