Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a Pandas dataframe with counts of items spanning a date range

Tags:

python

pandas

I have a DF that has two dates of interest that looks kind of like:

LIST_DATE     END_DATE
2000-04-18    2000-05-17 00:00:00
2000-05-18    2000-09-18 00:00:00
2000-04-18    2001-06-07 00:00:00

And I created a Period index table "montot" by month that currently only has the month and year index

<class 'pandas.tseries.period.PeriodIndex'>
freq: M
[1999-01, ..., 2013-07]

What I want to do is to for each month in the second table "montot" count the items in the 1st table which fall within the time periods (happens to be active listings by month) and add that field to the table... so for instance the 1st item in the 1st table would be counted 1 in month 4 and once in month 5 while the second item woul be counted once in month 5 through month 9 etc..with the monthly total being recorded in the new table/field. So I'll have a table

Month    active
1/1999     5
2/1999     8

etc.. Have not got a clue how to approach it with Pandas/Python...

like image 994
dartdog Avatar asked Mar 22 '23 10:03

dartdog


1 Answers

Here's one way to do it, first value_counts the periods in each of the date columns (using the to_period Timestamp method):

In [11]: p = pd.PeriodIndex(freq='m', start='2000-1', periods=18)

In [12]: starts = df['LIST_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()

In [13]: ends = df['END_DATE'].apply(lambda t: t.to_period(freq='m')).value_counts()

Reindex these by the PeriodIndex, fill in the NaNs (so you can subtract) and take the cumulative started from the cumulative ended, to give you the currently active:

In [14]: starts.reindex(p).fillna(0).cumsum() - ends.reindex(p).fillna(0).cumsum()
Out[14]: 
2000-01    0
2000-02    0
2000-03    0
2000-04    2
2000-05    2
2000-06    2
2000-07    2
2000-08    2
2000-09    1
2000-10    1
2000-11    1
2000-12    1
2001-01    1
2001-02    1
2001-03    1
2001-04    1
2001-05    1
2001-06    0
Freq: M, dtype: float64

An alternative final step is to create a DataFrame (which initially tracks changes, hence starts is positive and ends negative):

In [21]: current = pd.DataFrame({'starts': starts, 'ends': -ends}, p)

In [22]: current
Out[22]:
         ends  starts
2000-01   NaN     NaN
2000-02   NaN     NaN
2000-03   NaN     NaN
2000-04   NaN       2
2000-05    -1       1
2000-06   NaN     NaN
2000-07   NaN     NaN
2000-08   NaN     NaN
2000-09    -1     NaN
2000-10   NaN     NaN
2000-11   NaN     NaN
2000-12   NaN     NaN
2001-01   NaN     NaN
2001-02   NaN     NaN
2001-03   NaN     NaN
2001-04   NaN     NaN
2001-05   NaN     NaN
2001-06    -1     NaN

In [23]: current.fillna(0)
Out[23]:
         ends  starts
2000-01     0       0
2000-02     0       0
2000-03     0       0
2000-04     0       2
2000-05    -1       1
2000-06     0       0
2000-07     0       0
2000-08     0       0
2000-09    -1       0
2000-10     0       0
2000-11     0       0
2000-12     0       0
2001-01     0       0
2001-02     0       0
2001-03     0       0
2001-04     0       0
2001-05     0       0
2001-06    -1       0 

The cumsum track the running totals of starts and ends up to that point:

In [24]: current.fillna(0).cumsum()
Out[24]:
         ends  starts
2000-01     0       0
2000-02     0       0
2000-03     0       0
2000-04     0       2
2000-05    -1       3
2000-06    -1       3
2000-07    -1       3
2000-08    -1       3
2000-09    -2       3
2000-10    -2       3
2000-11    -2       3
2000-12    -2       3
2001-01    -2       3
2001-02    -2       3
2001-03    -2       3
2001-04    -2       3
2001-05    -2       3
2001-06    -3       3 

And summing these columns together, gives those currently active, and is same result as above:

In [25]: current.fillna(0).cumsum().sum(1)
like image 159
Andy Hayden Avatar answered Apr 06 '23 17:04

Andy Hayden