Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Transform and merge a date interval into a dummy variable in a panel

I have two dataframes, the main one is a monthly (MS) panel like this:

df = pd.DataFrame({'Location':['A', 'A', 'B', 'B'],
                   'Date':pd.to_datetime(['1990-1-1', '1990-2-1']*2, yearfirst=True)})

        Date Location
0 1990-01-01        A
1 1990-02-01        A
2 1990-01-01        B
3 1990-02-01        B

The second one is a list of events that includes locations, start date and end date (month first), like this:

events = pd.DataFrame({'Location':['A', 'B'], 
                   'Start Date':pd.to_datetime(['1/14/1990', '1/2/1990']), 
                   'End Date':pd.to_datetime(['1/15/1990', '2/13/1990'])})

  Location Start Date   End Date
0        A  1990-01-14  1990-01-15
1        B  1990-01-02  1990-02-13

What I need is to turn the start-and-end-date/location combos in the second dataframe into dummy variables in the first. In other words, I need a column that takes on the value of 1 if a particular location had an event on a given date, 0 otherwise. Like this:

        Date Location  Event
0 1990-01-01        A      1
1 1990-02-01        A      0
2 1990-01-01        B      1
3 1990-02-01        B      1

As you can see, the date 1990-1-1 did not fall in the range of an event in the second dataframe for location B, so it's a 0. Sometimes events will span multiple months, other times not. The day of the event within the month is not relevant, since the main data is all MS frequency. It's a large panel, so the same location will have events on many different dates, and the same date will have events in different locations.


The solution I've worked out is messy and not very fast:

events2 = pd.melt(events, id_vars='Location', 
                          value_vars=['Start Date', 'End Date'],
                          value_name='Event')

import datetime
def date_fill(g):
    #to make sure the 1st of a month is always in the range
    y, m = g['Event'].min().year, g['Event'].min().month
    date_range = pd.date_range(datetime.datetime(year=y, month=m, day=1),
                               g['Event'].max(),
                               freq='MS')
    return g.set_index('Event').reindex(date_range,
                                        fill_value=g['Location'].iloc[0])

events3 = events2.groupby('Location', as_index=False).apply(lambda g: date_fill(g))

Which gives me this:

             Location variable
0 1990-01-01        A        A
1 1990-01-01        B        B
  1990-02-01        B        B

Which I can then clean up a bit, create a column of all 1s, and left-merge into the first dataframe on location and date, filling NaNs with 0. It works, but it's obviously messy and slow (a smaller consideration than messy because the data isn't overly large). I feel like there must be a better way, but I haven't turned it up yet.

Edit: There are actually several problems with my "solution" also, as I explore this more, which was my fear with such a messy bit of work. Specifically it chokes on some corner cases, like when the event starts and ends on the 1st of the month (can't reindex with duplicates).

like image 965
Jeff Avatar asked Dec 19 '25 04:12

Jeff


1 Answers

This one should produce the desired output. (not fast)

df["Date"] = df["Date"].dt.to_period('M')
events["Start Date"] = events["Start Date"].dt.to_period('M')
events["End Date"] = events["End Date"].dt.to_period('M')
e_g = events.groupby("Location")   

def f(x):
    g = e_g.get_group(x.Location)
    return ((x.Date >= g["Start Date"])&(x.Date <= g["End Date"])).any()

df["dummy"] = df.apply(f, axis=1).astype(int)
df

    Date    Location  dummy
0   1990-01     A       1
1   1990-02     A       0
2   1990-01     B       1
3   1990-02     B       1
like image 101
Tai Avatar answered Dec 21 '25 16:12

Tai



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!