Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: decompress date range to individual dates

Dataset: I have a 1GB dataset of stocks, which have values between date ranges. There is no overlapping in date ranges and the dataset is sorted on (ticker, start_date).

>>> df.head()
             start_date    end_date                   val    
ticker         
AAPL         2014-05-01  2014-05-01         10.0000000000
AAPL         2014-06-05  2014-06-10         20.0000000000
GOOG         2014-06-01  2014-06-15         50.0000000000
MSFT         2014-06-16  2014-06-16                  None
TWTR         2014-01-17  2014-05-17         10.0000000000

Goal: I want to decompress the dataframe so that I have individual dates instead of date ranges. For example, the AAPL rows would go from being only 2 rows to 7 rows:

>>> AAPL_decompressed.head()
                   val
date                       
2014-05-01         10.0000000000
2014-06-05         20.0000000000
2014-06-06         20.0000000000
2014-06-07         20.0000000000
2014-06-08         20.0000000000

I'm hoping there's a nice optimized method from pandas like resample that can do this in a couple lines.

like image 974
Kamil Sindi Avatar asked Jun 05 '14 17:06

Kamil Sindi


2 Answers

A bit more than a few lines, but I think it results in what you asked:

Starting with your dataframe:

In [70]: df
Out[70]:
       start_date   end_date  val  row
ticker
AAPL   2014-05-01 2014-05-01   10    0
AAPL   2014-06-05 2014-06-10   20    1
GOOG   2014-06-01 2014-06-15   50    2
MSFT   2014-06-16 2014-06-16  NaN    3
TWTR   2014-01-17 2014-05-17   10    4

First I reshape this dataframe to a dataframe with one date column (so every row two times repeated for each date of start_date and end_date (and I add a counter column called row):

In [60]: df['row'] = range(len(df))
In [61]: starts = df[['start_date', 'val', 'row']].rename(columns={'start_date': 'date'})
In [62]: ends = df[['end_date', 'val', 'row']].rename(columns={'end_date':'date'})
In [63]: df_decomp = pd.concat([starts, ends])
In [64]: df_decomp = df_decomp.set_index('row', append=True)
In [65]: df_decomp.sort_index()
Out[65]:
                 date  val
ticker row
AAPL   0   2014-05-01   10
       0   2014-05-01   10
       1   2014-06-05   20
       1   2014-06-10   20
GOOG   2   2014-06-01   50
       2   2014-06-15   50
MSFT   3   2014-06-16  NaN
       3   2014-06-16  NaN
TWTR   4   2014-01-17   10
       4   2014-05-17   10

Based on this new dataframe, I can group it by ticker and row, and apply a daily resample on each of these groups and fillna (with method 'pad' to forward fill)

In [66]: df_decomp = df_decomp.groupby(level=[0,1]).apply(lambda x: x.set_index('date').resample('D').fillna(method='pad'))

In [67]: df_decomp = df_decomp.reset_index(level=1, drop=True)

The last command was to drop the now superfluous row index level.
When we access the AAPL rows, it gives your desired output:

In [69]: df_decomp.loc['AAPL']
Out[69]:
            val
date
2014-05-01   10
2014-06-05   20
2014-06-06   20
2014-06-07   20
2014-06-08   20
2014-06-09   20
2014-06-10   20
like image 75
joris Avatar answered Dec 25 '22 10:12

joris


I think you can do this in five steps:

1) filter the ticker column to find the stock you want

2) use pandas.bdate_range to build a list of date ranges between start and end

3) flatten this list using reduce

4) reindex your new filtered dataframe

5) fill nans using the method pad

Here is the code:

>>> import pandas as pd
>>> import datetime

>>> data = [('AAPL', datetime.date(2014, 4, 28), datetime.date(2014, 5, 2), 90),
            ('AAPL', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 80),
            ('MSFT', datetime.date(2014, 5, 5), datetime.date(2014, 5, 9), 150),
            ('AAPL', datetime.date(2014, 5, 12), datetime.date(2014, 5, 16), 85)]
>>> df = pd.DataFrame(data=data, columns=['ticker', 'start', 'end', 'val'])

>>> df_new = df[df['ticker'] == 'AAPL']
>>> df_new.name = 'AAPL'
>>> df_new.index = df_new['start']
>>> df_new.index.name = 'date'
>>> df_new.index = df_new.index.to_datetime()

>>> from functools import reduce #for py3k only
>>> new_index = [pd.bdate_range(**d) for d in df_new[['start','end']].to_dict('record')]
>>> new_index_flat = reduce(pd.tseries.index.DatetimeIndex.append, new_index)

>>> df_new = df_new.reindex(new_index_flat)
>>> df_new = df_new.fillna(method='pad')
>>> df_new
               ticker       start         end  val
    2014-04-28   AAPL  2014-04-28  2014-05-02   90
    2014-04-29   AAPL  2014-04-28  2014-05-02   90
    2014-04-30   AAPL  2014-04-28  2014-05-02   90
    2014-05-01   AAPL  2014-04-28  2014-05-02   90
    2014-05-02   AAPL  2014-04-28  2014-05-02   90
    2014-05-05   AAPL  2014-05-05  2014-05-09   80
    2014-05-06   AAPL  2014-05-05  2014-05-09   80
    2014-05-07   AAPL  2014-05-05  2014-05-09   80
    2014-05-08   AAPL  2014-05-05  2014-05-09   80
    2014-05-09   AAPL  2014-05-05  2014-05-09   80
    2014-05-12   AAPL  2014-05-12  2014-05-16   85
    2014-05-13   AAPL  2014-05-12  2014-05-16   85
    2014-05-14   AAPL  2014-05-12  2014-05-16   85
    2014-05-15   AAPL  2014-05-12  2014-05-16   85
    2014-05-16   AAPL  2014-05-12  2014-05-16   85

    [15 rows x 4 columns]

Hope it helps!

like image 29
antoniobotelho Avatar answered Dec 25 '22 10:12

antoniobotelho