I have a dataframe <code>df</code> which can be created with this: <pre class="prettyprint"><code>data={'id':[1,1,1,1,2,2,2,2], 'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4), datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)], 'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5), datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)], 'score1':[5,7,3,2,9,3,8,3], 'score2':[1,3,0,5,2,20,7,7]} df=pd.DataFrame.from_dict(data) And looks like this: id date1 date2 score1 score2 0 1 2016-01-01 2016-01-05 5 1 1 1 2016-01-02 2016-01-03 7 3 2 1 2016-01-03 2016-01-05 3 0 3 1 2016-01-04 2016-01-05 2 5 4 2 2016-01-02 2016-01-04 9 2 5 2 2016-01-04 2016-01-05 3 20 6 2 2016-01-03 2016-01-04 8 7 7 2 2016-01-01 2016-01-01 3 7 </code></pre> What I need to do is create a column for each of <code>score1</code> and <code>score2</code>, which creates two columns which SUM the values of <code>score1</code> and <code>score2</code> respectively, based on whether the <code>usedate</code> is between <code>date1</code> and <code>date2</code>. <code>usedate</code> is created by getting all dates between and including the <code>date1</code> minimum and the <code>date2</code> maximum. I used this to create the date range: <pre class="prettyprint"><code>drange=pd.date_range(df.date1.min(),df.date2.max()) </code></pre> The resulting dataframe <code>newdf</code> should look like: <pre class="prettyprint"><code> usedate score1sum score2sum 0 2016-01-01 8 8 1 2016-01-02 21 6 2 2016-01-03 32 13 3 2016-01-04 30 35 4 2016-01-05 13 26 </code></pre> For clarification, on <code>usedate</code> 2016-01-01, <code>score1sum</code> is 8, which is calculated by looking at the rows in <code>df</code> where 2016-01-01 is between and including <code>date1</code> and <code>date2</code>, which sum row0(5) and row8(3). On <code>usedate</code> 2016-01-04, <code>score2sum</code> is 35, which is calculated by looking at the rows in <code>df</code> where 2016-01-04 is between and including <code>date1</code> and <code>date2</code>, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7). Maybe some kind of <code>groupby</code>, or <code>melt</code> then <code>groupby</code>?

You can use <code>apply</code> with lambda function: <pre class="prettyprint"><code>df['date1'] = pd.to_datetime(df['date1']) df['date2'] = pd.to_datetime(df['date2']) df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum']) df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & (x.name <= df.date2), ['score1','score2']].sum(), axis=1) df1.rename_axis('usedate').reset_index() </code></pre> Output: <pre class="prettyprint"><code> usedate score1sum score2sum 0 2016-01-01 8 8 1 2016-01-02 21 6 2 2016-01-03 32 13 3 2016-01-04 30 35 4 2016-01-05 13 26 </code></pre>

Python Pandas Sum Values in Columns If date between 2 dates

Tags:

python

pandas

dataframe

pandas-groupby

melt

I have a dataframe df which can be created with this:

data={'id':[1,1,1,1,2,2,2,2],
      'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
               datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
      'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
               datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
      'score1':[5,7,3,2,9,3,8,3],
      'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)

And looks like this:
   id       date1       date2  score1  score2
0   1  2016-01-01  2016-01-05       5       1
1   1  2016-01-02  2016-01-03       7       3
2   1  2016-01-03  2016-01-05       3       0
3   1  2016-01-04  2016-01-05       2       5
4   2  2016-01-02  2016-01-04       9       2
5   2  2016-01-04  2016-01-05       3      20
6   2  2016-01-03  2016-01-04       8       7
7   2  2016-01-01  2016-01-01       3       7

What I need to do is create a column for each of score1 and score2, which creates two columns which SUM the values of score1 and score2 respectively, based on whether the usedate is between date1 and date2. usedate is created by getting all dates between and including the date1 minimum and the date2 maximum. I used this to create the date range:

drange=pd.date_range(df.date1.min(),df.date2.max())

The resulting dataframe newdf should look like:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

For clarification, on usedate 2016-01-01, score1sum is 8, which is calculated by looking at the rows in df where 2016-01-01 is between and including date1 and date2, which sum row0(5) and row8(3). On usedate 2016-01-04, score2sum is 35, which is calculated by looking at the rows in df where 2016-01-04 is between and including date1 and date2, which sum row0(1), row3(0), row4(5), row5(2), row6(20), row7(7).

Maybe some kind of groupby, or melt then groupby?

800

asked Jan 04 '18 21:01

clg4

1 Answers

You can use apply with lambda function:

df['date1'] = pd.to_datetime(df['date1'])

df['date2'] = pd.to_datetime(df['date2'])

df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])

df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) & 
                                                            (x.name <= df.date2),
                                                            ['score1','score2']].sum(), axis=1)

df1.rename_axis('usedate').reset_index()

Output:

     usedate  score1sum  score2sum
0 2016-01-01          8          8
1 2016-01-02         21          6
2 2016-01-03         32         13
3 2016-01-04         30         35
4 2016-01-05         13         26

186

answered Oct 08 '22 19:10

Scott Boston

Related questions
                            
                                Pytest - python testing with asyncio
                            
                                Django ORM, how to use values() and still work with choicefield?
                            
                                Error installing PyICU in python 3.6.2 recently installed package
                            
                                Reading stdin line at a time in Python
                            
                                Python package setup script install binary executable
                            
                                Take every nth block from list
                            
                                Require and option only if a choice is made when using click
                            
                                How to determine if a module name is part of python standard library
                            
                                Python refuses to iterate through lines in a file more than once [duplicate]
                            
                                Slack bot, register click on message button
                            
                                Run synchronous pull in Google Cloud Pub/Sub with the Python client API
                            
                                Python base image vs Ubuntu base image installing python separately in docker
                            
                                Combination of all rows in two numpy arrays
                            
                                grpc timeout in a celery task
                            
                                Pycharm - SyntaxError: Non-UTF-8 code starting with '\x80'
                            
                                How to write patterns for use with re.VERBOSE when they contain meaningful whitespace?
                            
                                Why does mypy say I have too many arguments
                            
                                Airflow - Task Instance in EMR operator
                            
                                Google OAuth token request returns "invalid_client": "Unauthorized"
                            
                                Python - Reading and writing csv files with utf-8 encoding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With