Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shift time in multi-index to merge

I want to merge two datasets that are indexed by time and id. The problem is, the time is slightly different in each dataset. In one dataset, the time (Monthly) is mid-month, so the 15th of every month. In the other dataset, it is the last business day. This should still be a one-to-one match, but the dates are not exactly the same.

My approach is to shift mid-month dates to business day end-of-month dates.

Data:

dt = pd.date_range('1/1/2011','12/31/2011', freq='D')
dt = dt[dt.day == 15]
lst = [1,2,3]
idx = pd.MultiIndex.from_product([dt,lst],names=['date','id'])
df = pd.DataFrame(np.random.randn(len(idx)), index=idx)
df.head()

output:

                      0
date       id
2011-01-15 1  -0.598584
           2  -0.484455
           3  -2.044912
2011-02-15 1  -0.017512
           2   0.852843

This is what I want (I removed the performance warning):

In[83]:df.index.levels[0] + BMonthEnd()
Out[83]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
               '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
               '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
              dtype='datetime64[ns]', freq='BM')

However, indexes are immutable, so this does not work:

In: df.index.levels[0] = df.index.levels[0] + BMonthEnd()

TypeError: 'FrozenList' does not support mutable operations.

The only solution I've got is to reset_index(), change the dates, then set_index() again:

df.reset_index(inplace=True)
df['date'] = df['date'] + BMonthEnd()
df.set_index(['date','id'], inplace=True)

This gives what I want, but is this the best way? Is there a set_level_values() function (I didn't see it in the API)?

Or maybe I'm taking the wrong approach to the merge. I could merge the dataset with keys df.index.get_level_values(0).year, df.index.get_level_values(0).month and id but this doesn't seem much better.

like image 669
Jesse Blocher Avatar asked May 30 '18 15:05

Jesse Blocher


2 Answers

You can use set_levels in order to set multiindex levels:

df.index.set_levels(df.index.levels[0] + pd.tseries.offsets.BMonthEnd(),
                    level='date', inplace=True)


>>> df.head()
                      0
date       id          
2011-01-31 1  -1.410646
           2   0.642618
           3  -0.537930
2011-02-28 1  -0.418943
           2   0.983186
like image 70
sacuL Avatar answered Sep 30 '22 12:09

sacuL


You could just build it again:

df.index = pd.MultiIndex.from_arrays(
    [
       df.index.get_level_values(0) + BMonthEnd(),
       df.index.get_level_values(1)
])

set_levels implicitly rebuilds the index under the covers. If you have more than two levels, this solution becomes unweildy, so consider using set_levels for typing brevity.

like image 44
cs95 Avatar answered Sep 30 '22 10:09

cs95