I have a DF that has two dates of interest that looks kind of like:
LIST_DATE END_DATE
2000-04-18 2000-05-17 00:00:00
2000-05-18 2000-09-18 00:00:00
2000-04-18 2001-06-07 00:00:00
And I created a Period index table "montot" by month that currently only has the month and year index
<class 'pandas.tseries.period.PeriodIndex'>
freq: M
[1999-01, ..., 2013-07]
What I want to do is to for each month in the second table "montot" count the items in the 1st table which fall within the time periods (happens to be active listings by month) and add that field to the table... so for instance the 1st item in the 1st table would be counted 1 in month 4 and once in month 5 while the second item woul be counted once in month 5 through month 9 etc..with the monthly total being recorded in the new table/field. So I'll have a table
Month active
1/1999 5
2/1999 8
etc.. Have not got a clue how to approach it with Pandas/Python...
Here's one way to do it, first value_counts
the periods in each of the date columns (using the to_period
Timestamp method):
In [11]: p = pd.PeriodIndex(freq='m', start='2000-1', periods=18)
In [12]: starts = df['LIST_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()
In [13]: ends = df['END_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()
Reindex these by the PeriodIndex, fill in the NaNs (so you can subtract) and take the cumulative started from the cumulative ended, to give you the currently active:
In [14]: starts.reindex(p).fillna(0).cumsum() - ends.reindex(p).fillna(0).cumsum()
Out[14]:
2000-01 0
2000-02 0
2000-03 0
2000-04 2
2000-05 2
2000-06 2
2000-07 2
2000-08 2
2000-09 1
2000-10 1
2000-11 1
2000-12 1
2001-01 1
2001-02 1
2001-03 1
2001-04 1
2001-05 1
2001-06 0
Freq: M, dtype: float64
An alternative final step is to create a DataFrame (which initially tracks changes, hence starts is positive and ends negative):
In [21]: current = pd.DataFrame({'starts': starts, 'ends': -ends}, p)
In [22]: current
Out[22]:
ends starts
2000-01 NaN NaN
2000-02 NaN NaN
2000-03 NaN NaN
2000-04 NaN 2
2000-05 -1 1
2000-06 NaN NaN
2000-07 NaN NaN
2000-08 NaN NaN
2000-09 -1 NaN
2000-10 NaN NaN
2000-11 NaN NaN
2000-12 NaN NaN
2001-01 NaN NaN
2001-02 NaN NaN
2001-03 NaN NaN
2001-04 NaN NaN
2001-05 NaN NaN
2001-06 -1 NaN
In [23]: current.fillna(0)
Out[23]:
ends starts
2000-01 0 0
2000-02 0 0
2000-03 0 0
2000-04 0 2
2000-05 -1 1
2000-06 0 0
2000-07 0 0
2000-08 0 0
2000-09 -1 0
2000-10 0 0
2000-11 0 0
2000-12 0 0
2001-01 0 0
2001-02 0 0
2001-03 0 0
2001-04 0 0
2001-05 0 0
2001-06 -1 0
The cumsum track the running totals of starts and ends up to that point:
In [24]: current.fillna(0).cumsum()
Out[24]:
ends starts
2000-01 0 0
2000-02 0 0
2000-03 0 0
2000-04 0 2
2000-05 -1 3
2000-06 -1 3
2000-07 -1 3
2000-08 -1 3
2000-09 -2 3
2000-10 -2 3
2000-11 -2 3
2000-12 -2 3
2001-01 -2 3
2001-02 -2 3
2001-03 -2 3
2001-04 -2 3
2001-05 -2 3
2001-06 -3 3
And summing these columns together, gives those currently active, and is same result as above:
In [25]: current.fillna(0).cumsum().sum(1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With