I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables. The top of the first table table_a looks like this: <pre class="prettyprint"><code> COMPANY_ID DATE MEASURE 1 2010-01-01 00:00:00 10 1 2010-01-02 00:00:00 10 1 2010-01-03 00:00:00 10 1 2010-01-04 00:00:00 10 1 2010-01-05 00:00:00 10 </code></pre> Here is the code to create the table: <pre class="prettyprint"><code> table_a = pd.concat(\ [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\ 'COMPANY_ID': 1 , 'MEASURE': 10}),\ pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\ 'COMPANY_ID': 2 , 'MEASURE': 10})]) </code></pre> The second table, table_b looks like this: <pre class="prettyprint"><code> COMPANY END_DATE 1 2010-03-01 00:00:00 1 2010-06-02 00:00:00 2 2010-03-01 00:00:00 2 2010-06-02 00:00:00 </code></pre> and the code to create it is: <pre class="prettyprint"><code> table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\ 'COMPANY':(1,1,2,2)}) </code></pre> I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b. This is (I think) the SQL equivalent: <pre class="prettyprint"><code> select b.COMPANY_ID, b.DATE sum(a.MEASURE) AS MEASURE_TO_END_DATE from table_a a, table_b b where a.COMPANY = b.COMPANY and a.DATE < b.DATE and a.DATE > b.DATE - 30 group by b.COMPANY; </code></pre> Thanks for any help

Well, I can think of a few ways: <ol> <li>essentially blow up the dataframe by just merging on the exact field (<code>company</code>)... then filter on the 30-day windows after the merge. </li> </ol> <ul> <li>should be fast but could use lots of memory</li> </ul> <ol start="2"> <li> Move the merging and filtering on the 30-day window into a <code>groupby()</code>.</li> </ol> <ul> <li>results in a merge for each group, so slower but should use less memory</li> </ul> Option #1 Suppose your data looks like the following (I expanded your sample data): <pre class="prettyprint"><code>print df company date measure 0 0 2010-01-01 10 1 0 2010-01-15 10 2 0 2010-02-01 10 3 0 2010-02-15 10 4 0 2010-03-01 10 5 0 2010-03-15 10 6 0 2010-04-01 10 7 1 2010-03-01 5 8 1 2010-03-15 5 9 1 2010-04-01 5 10 1 2010-04-15 5 11 1 2010-05-01 5 12 1 2010-05-15 5 print windows company end_date 0 0 2010-02-01 1 0 2010-03-15 2 1 2010-04-01 3 1 2010-05-15 </code></pre> Create a beginning date for the 30 day windows: <pre class="prettyprint"><code>windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) print windows company end_date beg_date 0 0 2010-02-01 2010-01-02 1 0 2010-03-15 2010-02-13 2 1 2010-04-01 2010-03-02 3 1 2010-05-15 2010-04-15 </code></pre> Now do a merge and then select based on if <code>date</code> falls within <code>beg_date</code> and <code>end_date</code>: <pre class="prettyprint"><code>df = df.merge(windows,on='company',how='left') df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df company date measure end_date beg_date 2 0 2010-01-15 10 2010-02-01 2010-01-02 4 0 2010-02-01 10 2010-02-01 2010-01-02 7 0 2010-02-15 10 2010-03-15 2010-02-13 9 0 2010-03-01 10 2010-03-15 2010-02-13 11 0 2010-03-15 10 2010-03-15 2010-02-13 16 1 2010-03-15 5 2010-04-01 2010-03-02 18 1 2010-04-01 5 2010-04-01 2010-03-02 21 1 2010-04-15 5 2010-05-15 2010-04-15 23 1 2010-05-01 5 2010-05-15 2010-04-15 25 1 2010-05-15 5 2010-05-15 2010-04-15 </code></pre> You can compute the 30 day window sums by grouping on <code>company</code> and <code>end_date</code>: <pre class="prettyprint"><code>print df.groupby(['company','end_date']).sum() measure company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15 </code></pre> Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower: <pre class="prettyprint"><code>windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) def cond_merge(g,windows): g = g.merge(windows,on='company',how='left') g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)] return g.groupby('end_date')['measure'].sum() print df.groupby('company').apply(cond_merge,windows) company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15 </code></pre> Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast: <pre class="prettyprint"><code>windows['date'] = windows['end_date'] df = df.merge(windows,on=['company','date'],how='outer') print df company date measure end_date 0 0 2010-01-01 10 NaT 1 0 2010-01-15 10 NaT 2 0 2010-02-01 10 2010-02-01 3 0 2010-02-15 10 NaT 4 0 2010-03-01 10 NaT 5 0 2010-03-15 10 2010-03-15 6 0 2010-04-01 10 NaT 7 1 2010-03-01 5 NaT 8 1 2010-03-15 5 NaT 9 1 2010-04-01 5 2010-04-01 10 1 2010-04-15 5 NaT 11 1 2010-05-01 5 NaT 12 1 2010-05-15 5 2010-05-15 </code></pre> This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows: <pre class="prettyprint"><code>df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill()) print df company date measure end_date 0 0 2010-01-01 10 2010-02-01 1 0 2010-01-15 10 2010-02-01 2 0 2010-02-01 10 2010-02-01 3 0 2010-02-15 10 2010-03-15 4 0 2010-03-01 10 2010-03-15 5 0 2010-03-15 10 2010-03-15 6 0 2010-04-01 10 NaT 7 1 2010-03-01 5 2010-04-01 8 1 2010-03-15 5 2010-04-01 9 1 2010-04-01 5 2010-04-01 10 1 2010-04-15 5 2010-05-15 11 1 2010-05-01 5 2010-05-15 12 1 2010-05-15 5 2010-05-15 df = df[df.end_date.notnull()] df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') - np.timedelta64(30,'D')) print df company date measure end_date beg_date 0 0 2010-01-01 10 2010-02-01 2010-01-02 1 0 2010-01-15 10 2010-02-01 2010-01-02 2 0 2010-02-01 10 2010-02-01 2010-01-02 3 0 2010-02-15 10 2010-03-15 2010-02-13 4 0 2010-03-01 10 2010-03-15 2010-02-13 5 0 2010-03-15 10 2010-03-15 2010-02-13 7 1 2010-03-01 5 2010-04-01 2010-03-02 8 1 2010-03-15 5 2010-04-01 2010-03-02 9 1 2010-04-01 5 2010-04-01 2010-03-02 10 1 2010-04-15 5 2010-05-15 2010-04-15 11 1 2010-05-01 5 2010-05-15 2010-04-15 12 1 2010-05-15 5 2010-05-15 2010-04-15 df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df.groupby(['company','end_date']).sum() measure company end_date 0 2010-02-01 20 2010-03-15 30 1 2010-04-01 10 2010-05-15 15 </code></pre> Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

How to do/workaround a conditional join in python Pandas?

Q: How do you cross join in Python?

In Pandas, there are parameters to perform left, right, inner or outer merge and join on two DataFrames or Series. However there's no possibility as of now to perform a cross join to merge or join two methods using how="cross" parameter. # merge on that key.

Q: How do I apply if else condition for a column in Pandas?

Use DataFrame. apply() to Apply the if-else Condition in a Pandas DataFrame in Python. The apply() method uses the data frame's axis (row or column) to apply a function. We can make our defined function that consists of if-else conditions and apply it to the Pandas dataframe.

Tags:

python

join

pandas

dataframe

conditional-statements

I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.

The top of the first table table_a looks like this:

    COMPANY_ID  DATE            MEASURE     1   2010-01-01 00:00:00     10     1   2010-01-02 00:00:00     10     1   2010-01-03 00:00:00     10     1   2010-01-04 00:00:00     10     1   2010-01-05 00:00:00     10

Here is the code to create the table:

    table_a = pd.concat(\     [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\     'COMPANY_ID': 1 , 'MEASURE': 10}),\     pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\     'COMPANY_ID': 2 , 'MEASURE': 10})])

The second table, table_b looks like this:

        COMPANY     END_DATE         1   2010-03-01 00:00:00         1   2010-06-02 00:00:00         2   2010-03-01 00:00:00         2   2010-06-02 00:00:00

and the code to create it is:

    table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\                     'COMPANY':(1,1,2,2)})

I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.

This is (I think) the SQL equivalent:

      select  b.COMPANY_ID,  b.DATE  sum(a.MEASURE) AS MEASURE_TO_END_DATE  from table_a a, table_b b  where a.COMPANY = b.COMPANY and        a.DATE < b.DATE and        a.DATE > b.DATE - 30    group by b.COMPANY;

Thanks for any help

827

asked May 07 '14 03:05

JAB

1 Answers

Well, I can think of a few ways:

essentially blow up the dataframe by just merging on the exact field (company)... then filter on the 30-day windows after the merge.

should be fast but could use lots of memory

Move the merging and filtering on the 30-day window into a groupby().

results in a merge for each group, so slower but should use less memory

Option #1

Suppose your data looks like the following (I expanded your sample data):

print df      company       date  measure 0         0 2010-01-01       10 1         0 2010-01-15       10 2         0 2010-02-01       10 3         0 2010-02-15       10 4         0 2010-03-01       10 5         0 2010-03-15       10 6         0 2010-04-01       10 7         1 2010-03-01        5 8         1 2010-03-15        5 9         1 2010-04-01        5 10        1 2010-04-15        5 11        1 2010-05-01        5 12        1 2010-05-15        5  print windows     company   end_date 0        0 2010-02-01 1        0 2010-03-15 2        1 2010-04-01 3        1 2010-05-15

Create a beginning date for the 30 day windows:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -                        np.timedelta64(30,'D')) print windows     company   end_date   beg_date 0        0 2010-02-01 2010-01-02 1        0 2010-03-15 2010-02-13 2        1 2010-04-01 2010-03-02 3        1 2010-05-15 2010-04-15

Now do a merge and then select based on if date falls within beg_date and end_date:

df = df.merge(windows,on='company',how='left') df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df      company       date  measure   end_date   beg_date 2         0 2010-01-15       10 2010-02-01 2010-01-02 4         0 2010-02-01       10 2010-02-01 2010-01-02 7         0 2010-02-15       10 2010-03-15 2010-02-13 9         0 2010-03-01       10 2010-03-15 2010-02-13 11        0 2010-03-15       10 2010-03-15 2010-02-13 16        1 2010-03-15        5 2010-04-01 2010-03-02 18        1 2010-04-01        5 2010-04-01 2010-03-02 21        1 2010-04-15        5 2010-05-15 2010-04-15 23        1 2010-05-01        5 2010-05-15 2010-04-15 25        1 2010-05-15        5 2010-05-15 2010-04-15

You can compute the 30 day window sums by grouping on company and end_date:

print df.groupby(['company','end_date']).sum()                      measure company end_date            0       2010-02-01       20         2010-03-15       30 1       2010-04-01       10         2010-05-15       15

Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -                        np.timedelta64(30,'D'))  def cond_merge(g,windows):     g = g.merge(windows,on='company',how='left')     g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]     return g.groupby('end_date')['measure'].sum()  print df.groupby('company').apply(cond_merge,windows)  company  end_date   0        2010-02-01    20          2010-03-15    30 1        2010-04-01    10          2010-05-15    15

Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:

windows['date'] = windows['end_date']  df = df.merge(windows,on=['company','date'],how='outer') print df      company       date  measure   end_date 0         0 2010-01-01       10        NaT 1         0 2010-01-15       10        NaT 2         0 2010-02-01       10 2010-02-01 3         0 2010-02-15       10        NaT 4         0 2010-03-01       10        NaT 5         0 2010-03-15       10 2010-03-15 6         0 2010-04-01       10        NaT 7         1 2010-03-01        5        NaT 8         1 2010-03-15        5        NaT 9         1 2010-04-01        5 2010-04-01 10        1 2010-04-15        5        NaT 11        1 2010-05-01        5        NaT 12        1 2010-05-15        5 2010-05-15

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())  print df      company       date  measure   end_date 0         0 2010-01-01       10 2010-02-01 1         0 2010-01-15       10 2010-02-01 2         0 2010-02-01       10 2010-02-01 3         0 2010-02-15       10 2010-03-15 4         0 2010-03-01       10 2010-03-15 5         0 2010-03-15       10 2010-03-15 6         0 2010-04-01       10        NaT 7         1 2010-03-01        5 2010-04-01 8         1 2010-03-15        5 2010-04-01 9         1 2010-04-01        5 2010-04-01 10        1 2010-04-15        5 2010-05-15 11        1 2010-05-01        5 2010-05-15 12        1 2010-05-15        5 2010-05-15  df = df[df.end_date.notnull()] df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -                    np.timedelta64(30,'D'))  print df     company       date  measure   end_date   beg_date 0         0 2010-01-01       10 2010-02-01 2010-01-02 1         0 2010-01-15       10 2010-02-01 2010-01-02 2         0 2010-02-01       10 2010-02-01 2010-01-02 3         0 2010-02-15       10 2010-03-15 2010-02-13 4         0 2010-03-01       10 2010-03-15 2010-02-13 5         0 2010-03-15       10 2010-03-15 2010-02-13 7         1 2010-03-01        5 2010-04-01 2010-03-02 8         1 2010-03-15        5 2010-04-01 2010-03-02 9         1 2010-04-01        5 2010-04-01 2010-03-02 10        1 2010-04-15        5 2010-05-15 2010-04-15 11        1 2010-05-01        5 2010-05-15 2010-04-15 12        1 2010-05-15        5 2010-05-15 2010-04-15  df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df.groupby(['company','end_date']).sum()                      measure company end_date            0       2010-02-01       20         2010-03-15       30 1       2010-04-01       10         2010-05-15       15

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

165

answered Oct 04 '22 14:10

Karl D.

Related questions
                            
                                What does Keras.io.preprocessing.sequence.pad_sequences do?
                            
                                How to square or raise to a power (elementwise) a 2D numpy array?
                            
                                Easy way to check that a variable is defined in python? [duplicate]
                            
                                Non-global middleware in Django
                            
                                Algorithm to find which number in a list sum up to a certain number
                            
                                Store and reload matplotlib.pyplot object
                            
                                Seaborn: How to add vertical lines to a distribution plot (sns.distplot)
                            
                                mod_wsgi, mod_python, or just cgi?
                            
                                My rst README is not formatted on pypi.python.org
                            
                                Sklearn SGDClassifier partial fit
                            
                                How to give jupyter cell standard input in python?
                            
                                What's the time complexity of functions in heapq library
                            
                                How to use a (random) *.otf or *.ttf font in matplotlib?
                            
                                Proper exception to raise if None encountered as argument
                            
                                Flask - ImportError: No module named app
                            
                                How can I write unit tests against code that uses matplotlib?
                            
                                Spark RDD to DataFrame python
                            
                                How do I prevent Python's urllib(2) from following a redirect
                            
                                How to run Flask with Gunicorn in multithreaded mode
                            
                                memory error in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With