I have a dataframe df
which can be created with this:
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
'score1':[5,7,3,2,9,3,8,3],
'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)
And looks like this:
id date1 date2 score1 score2
0 1 2016-01-01 2016-01-05 5 1
1 1 2016-01-02 2016-01-03 7 3
2 1 2016-01-03 2016-01-05 3 0
3 1 2016-01-04 2016-01-05 2 5
4 2 2016-01-02 2016-01-04 9 2
5 2 2016-01-04 2016-01-05 3 20
6 2 2016-01-03 2016-01-04 8 7
7 2 2016-01-01 2016-01-01 3 7
What I need to do is create a column for each of score1
and score2
, which creates two columns which SUM the values of score1
and score2
respectively, based on whether the usedate
is between date1
and date2
. usedate
is created by getting all dates between and including the date1
minimum and the date2
maximum. I used this to create the date range:
drange=pd.date_range(df.date1.min(),df.date2.max())
The resulting dataframe newdf
should look like:
usedate score1sum score2sum
0 2016-01-01 8 8
1 2016-01-02 21 6
2 2016-01-03 32 13
3 2016-01-04 30 35
4 2016-01-05 13 26
For clarification, on usedate
2016-01-01, score1sum
is 8, which is calculated by looking at the rows in df
where 2016-01-01 is between and including date1
and date2
, which sum row0(5) and row8(3). On usedate
2016-01-04, score2sum
is 35, which is calculated by looking at the rows in df
where 2016-01-04 is between and including date1
and date2
, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).
Maybe some kind of groupby
, or melt
then groupby
?
You can use pandas. Series. between() method to select DataFrame rows between two dates. This method returns a boolean vector representing whether series element lies in the specified range or not.
The sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.
We will take a dataframe and have two columns for the dates between which we want to get the difference. Use df. dates1-df. dates2 to find the difference between the two dates and then convert the result in the form of months.
There are several ways to calculate the time difference between two dates in Python using Pandas. The first is to subtract one date from the other. This returns a timedelta such as 0 days 05:00:00 that tells us the number of days, hours, minutes, and seconds between the two dates.
You can use apply
with lambda function:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])
df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) &
(x.name <= df.date2),
['score1','score2']].sum(), axis=1)
df1.rename_axis('usedate').reset_index()
Output:
usedate score1sum score2sum
0 2016-01-01 8 8
1 2016-01-02 21 6
2 2016-01-03 32 13
3 2016-01-04 30 35
4 2016-01-05 13 26
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With