F_Date B_Date col is_B
01/09/2019 02/08/2019 2200 1
01/09/2019 03/08/2019 672 1
02/09/2019 03/08/2019 1828 1
01/09/2019 04/08/2019 503 0
02/09/2019 04/08/2019 829 1
03/09/2019 04/08/2019 1367 0
02/09/2019 05/08/2019 559 1
03/09/2019 05/08/2019 922 1
04/09/2019 05/08/2019 1519 0
01/09/2019 06/08/2019 376 1
I want to generate a column c_a
such that for first entry of flight_date initially the value is 25000
and decreases based on col value. For example :
Expected Output :
F_Date B_Date col is_B c_a
01/09/2019 02/08/2019 2200 1 25000
01/09/2019 03/08/2019 672 1 25000 - 2200
02/09/2019 03/08/2019 1828 1 25000
01/09/2019 04/08/2019 503 0 25000 - 2200 - 672
02/09/2019 04/08/2019 829 1 25000 - 1828
03/09/2019 04/08/2019 1367 0 25000
02/09/2019 05/08/2019 559 1 25000 - 1828 - 829
03/09/2019 05/08/2019 922 1 25000 (since last value had is_B as 0)
04/09/2019 05/08/2019 1519 0 25000
01/09/2019 06/08/2019 376 1 25000 - 2200 - 672 (Since last appearance had is_B as 0)
Can anyone identify a pandas way to achieve the same?
How do you create a new column based on a condition in pandas? Depending upon the use case, you can use np.where (), a list comprehension, a custom function, or a mapping with a dictionary, etc. to create a column with values based on some condition.
Pandas’ loc creates a boolean mask, based on a condition. Sometimes, that condition can just be selecting rows and columns, but it can also be used to filter dataframes. These filtered dataframes can then have values applied to them.
Below is the given pandas DataFrame to which we will add the additional columns. It describes the Days and Subjects of an examination. Next we decide to add another column specifying the time of the exam. Here we add the condition using if statement and name the additional column as Time.
Depending upon the use case, you can use np.where (), a list comprehension, a custom function, or a mapping with a dictionary, etc. to create a column with values based on some condition. The general idea is to first get a list or a series of values that satisfy our condition and then assign the new column to those values.
I think, I have found a quite concise solution:
df['c_a'] = df.groupby('F_Date').apply(lambda grp:
25000 - grp.col.where(grp.is_B.eq(1), 0).shift(fill_value=0)
.cumsum()).reset_index(level=0, drop=True)
The result is:
F_Date B_Date col is_B c_a
0 01/09/2019 02/08/2019 2200 1 25000
1 01/09/2019 03/08/2019 672 1 22800
2 02/09/2019 03/08/2019 1828 1 25000
3 01/09/2019 04/08/2019 503 0 22128
4 02/09/2019 04/08/2019 829 1 23172
5 03/09/2019 04/08/2019 1367 0 25000
6 02/09/2019 05/08/2019 559 1 22343
7 03/09/2019 05/08/2019 922 1 25000
8 04/09/2019 05/08/2019 1519 0 25000
9 01/09/2019 06/08/2019 376 1 22128
The idea, with examples based on group F_Date == '01/09/2019':
grp.col.where(grp.is_B.eq(1), 0)
- the value to subtract from
the next row in group:
0 2200
1 672
3 0
9 376
.shift(fill_value=0)
- the value to subtract from the current
row in group:
0 0
1 2200
3 672
9 0
.cumsum()
- cumulated values to subtract:
0 0
1 2200
3 2872
9 2872
25000 - ...
- the target value:
0 25000
1 22800
3 22128
9 22128
Nice pandas game :)
import pandas as pd
df = pd.DataFrame({'F_Date': [pd.to_datetime(_, format='%d/%m/%Y') for _ in
['01/09/2019', '01/09/2019', '02/09/2019', '01/09/2019', '02/09/2019',
'03/09/2019', '02/09/2019', '03/09/2019', '04/09/2019', '01/09/2019']],
'B_Date': [pd.to_datetime(_, format='%d/%m/%Y') for _ in
['02/08/2019', '03/08/2019', '03/08/2019', '04/08/2019', '04/08/2019',
'04/08/2019', '05/08/2019', '05/08/2019','05/08/2019', '06/08/2019']],
'col': [2200, 672, 1828, 503, 829, 1367, 559, 922, 1519, 376],
'is_B': [1, 1, 1, 0, 1, 0, 1, 1, 0, 1]
})
Let's go through it step by step:
# sort in the order that fits the semantics of your calculations
df.sort_values(['F_Date', 'B_Date'], inplace=True)
# initialize 'c_a' to 25000 if a new F_Date starts
df.loc[df['F_Date'].diff(1) != pd.Timedelta(0), 'c_a'] = 25000
# Step downwards from every 25000 and substract shifted 'col'
# if shifted 'is_B' == 1, otherwise replicate shifted 'c_a' to the next line
while pd.isna(df.c_a).any():
df.c_a.where(
pd.notna(df.c_a), # set every not-NaN value to ...
df.c_a.shift(1).where( # ...the previous / shifted c_a...
df.is_B.shift(1) == 0, # ... if previous / shifted is_B == 0
df.c_a.shift(1) - df.col.shift(1) # ... otherwise substract shifted 'col'
), inplace=True
)
# restore original order
df.sort_index(inplace=True)
This is the result I get
F_Date B_Date col is_B c_a
0 2019-09-01 2019-08-02 2200 1 25000.0
1 2019-09-01 2019-08-03 672 1 22800.0
2 2019-09-02 2019-08-03 1828 1 25000.0
3 2019-09-01 2019-08-04 503 0 22128.0
4 2019-09-02 2019-08-04 829 1 23172.0
5 2019-09-03 2019-08-04 1367 0 25000.0
6 2019-09-02 2019-08-05 559 1 22343.0
7 2019-09-03 2019-08-05 922 1 25000.0
8 2019-09-04 2019-08-05 1519 0 25000.0
9 2019-09-01 2019-08-06 376 1 22128.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With