Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do/workaround a conditional join in python Pandas?

I am trying to calculate time based aggregations in Pandas based on date values stored in a separate tables.

The top of the first table table_a looks like this:

    COMPANY_ID  DATE            MEASURE     1   2010-01-01 00:00:00     10     1   2010-01-02 00:00:00     10     1   2010-01-03 00:00:00     10     1   2010-01-04 00:00:00     10     1   2010-01-05 00:00:00     10 

Here is the code to create the table:

    table_a = pd.concat(\     [pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\     'COMPANY_ID': 1 , 'MEASURE': 10}),\     pd.DataFrame({'DATE': pd.date_range("01/01/2010", "12/31/2010", freq="D"),\     'COMPANY_ID': 2 , 'MEASURE': 10})]) 

The second table, table_b looks like this:

        COMPANY     END_DATE         1   2010-03-01 00:00:00         1   2010-06-02 00:00:00         2   2010-03-01 00:00:00         2   2010-06-02 00:00:00 

and the code to create it is:

    table_b = pd.DataFrame({'END_DATE':pd.to_datetime(['03/01/2010','06/02/2010','03/01/2010','06/02/2010']),\                     'COMPANY':(1,1,2,2)}) 

I want to be able to get the sum of the measure column for each COMPANY_ID for each 30 day period prior to the END_DATE in table_b.

This is (I think) the SQL equivalent:

      select  b.COMPANY_ID,  b.DATE  sum(a.MEASURE) AS MEASURE_TO_END_DATE  from table_a a, table_b b  where a.COMPANY = b.COMPANY and        a.DATE < b.DATE and        a.DATE > b.DATE - 30    group by b.COMPANY; 

Thanks for any help

like image 827
JAB Avatar asked May 07 '14 03:05

JAB


People also ask

How do you use between conditions in Pandas?

Boolean Series in PandasThe between() function is used to get boolean Series equivalent to left <= series <= right. This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

How do you cross join in Python?

In Pandas, there are parameters to perform left, right, inner or outer merge and join on two DataFrames or Series. However there's no possibility as of now to perform a cross join to merge or join two methods using how="cross" parameter. # merge on that key.

How do I apply if else condition for a column in Pandas?

Use DataFrame. apply() to Apply the if-else Condition in a Pandas DataFrame in Python. The apply() method uses the data frame's axis (row or column) to apply a function. We can make our defined function that consists of if-else conditions and apply it to the Pandas dataframe.


1 Answers

Well, I can think of a few ways:

  1. essentially blow up the dataframe by just merging on the exact field (company)... then filter on the 30-day windows after the merge.
  • should be fast but could use lots of memory
  1. Move the merging and filtering on the 30-day window into a groupby().
  • results in a merge for each group, so slower but should use less memory

Option #1

Suppose your data looks like the following (I expanded your sample data):

print df      company       date  measure 0         0 2010-01-01       10 1         0 2010-01-15       10 2         0 2010-02-01       10 3         0 2010-02-15       10 4         0 2010-03-01       10 5         0 2010-03-15       10 6         0 2010-04-01       10 7         1 2010-03-01        5 8         1 2010-03-15        5 9         1 2010-04-01        5 10        1 2010-04-15        5 11        1 2010-05-01        5 12        1 2010-05-15        5  print windows     company   end_date 0        0 2010-02-01 1        0 2010-03-15 2        1 2010-04-01 3        1 2010-05-15 

Create a beginning date for the 30 day windows:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -                        np.timedelta64(30,'D')) print windows     company   end_date   beg_date 0        0 2010-02-01 2010-01-02 1        0 2010-03-15 2010-02-13 2        1 2010-04-01 2010-03-02 3        1 2010-05-15 2010-04-15 

Now do a merge and then select based on if date falls within beg_date and end_date:

df = df.merge(windows,on='company',how='left') df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df      company       date  measure   end_date   beg_date 2         0 2010-01-15       10 2010-02-01 2010-01-02 4         0 2010-02-01       10 2010-02-01 2010-01-02 7         0 2010-02-15       10 2010-03-15 2010-02-13 9         0 2010-03-01       10 2010-03-15 2010-02-13 11        0 2010-03-15       10 2010-03-15 2010-02-13 16        1 2010-03-15        5 2010-04-01 2010-03-02 18        1 2010-04-01        5 2010-04-01 2010-03-02 21        1 2010-04-15        5 2010-05-15 2010-04-15 23        1 2010-05-01        5 2010-05-15 2010-04-15 25        1 2010-05-15        5 2010-05-15 2010-04-15 

You can compute the 30 day window sums by grouping on company and end_date:

print df.groupby(['company','end_date']).sum()                      measure company end_date            0       2010-02-01       20         2010-03-15       30 1       2010-04-01       10         2010-05-15       15 

Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:

windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -                        np.timedelta64(30,'D'))  def cond_merge(g,windows):     g = g.merge(windows,on='company',how='left')     g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]     return g.groupby('end_date')['measure'].sum()  print df.groupby('company').apply(cond_merge,windows)  company  end_date   0        2010-02-01    20          2010-03-15    30 1        2010-04-01    10          2010-05-15    15 

Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:

windows['date'] = windows['end_date']  df = df.merge(windows,on=['company','date'],how='outer') print df      company       date  measure   end_date 0         0 2010-01-01       10        NaT 1         0 2010-01-15       10        NaT 2         0 2010-02-01       10 2010-02-01 3         0 2010-02-15       10        NaT 4         0 2010-03-01       10        NaT 5         0 2010-03-15       10 2010-03-15 6         0 2010-04-01       10        NaT 7         1 2010-03-01        5        NaT 8         1 2010-03-15        5        NaT 9         1 2010-04-01        5 2010-04-01 10        1 2010-04-15        5        NaT 11        1 2010-05-01        5        NaT 12        1 2010-05-15        5 2010-05-15 

This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:

df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())  print df      company       date  measure   end_date 0         0 2010-01-01       10 2010-02-01 1         0 2010-01-15       10 2010-02-01 2         0 2010-02-01       10 2010-02-01 3         0 2010-02-15       10 2010-03-15 4         0 2010-03-01       10 2010-03-15 5         0 2010-03-15       10 2010-03-15 6         0 2010-04-01       10        NaT 7         1 2010-03-01        5 2010-04-01 8         1 2010-03-15        5 2010-04-01 9         1 2010-04-01        5 2010-04-01 10        1 2010-04-15        5 2010-05-15 11        1 2010-05-01        5 2010-05-15 12        1 2010-05-15        5 2010-05-15  df = df[df.end_date.notnull()] df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -                    np.timedelta64(30,'D'))  print df     company       date  measure   end_date   beg_date 0         0 2010-01-01       10 2010-02-01 2010-01-02 1         0 2010-01-15       10 2010-02-01 2010-01-02 2         0 2010-02-01       10 2010-02-01 2010-01-02 3         0 2010-02-15       10 2010-03-15 2010-02-13 4         0 2010-03-01       10 2010-03-15 2010-02-13 5         0 2010-03-15       10 2010-03-15 2010-02-13 7         1 2010-03-01        5 2010-04-01 2010-03-02 8         1 2010-03-15        5 2010-04-01 2010-03-02 9         1 2010-04-01        5 2010-04-01 2010-03-02 10        1 2010-04-15        5 2010-05-15 2010-04-15 11        1 2010-05-01        5 2010-05-15 2010-04-15 12        1 2010-05-15        5 2010-05-15 2010-04-15  df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)] print df.groupby(['company','end_date']).sum()                      measure company end_date            0       2010-02-01       20         2010-03-15       30 1       2010-04-01       10         2010-05-15       15 

Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.

like image 165
Karl D. Avatar answered Oct 04 '22 14:10

Karl D.