I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.
index date_from date_to
0 '2019-08-01' '2019-08-05'
1 '2019-08-04' '2019-08-07'
2 '2019-08-07' '2019-08-09'
And I need to calculate the number of dates for all ranges. The result should be like this:
date count
'2019-08-01' 1
'2019-08-02' 1
'2019-08-03' 1
'2019-08-04' 2
'2019-08-05' 2
'2019-08-06' 1
'2019-08-07' 2
'2019-08-08' 1
'2019-08-09' 1
I used a for-loop to solve the problem, but the calculations take a very long time because the original dataframe is quite large.
One option could be to Index.repeat the rows based on the number of days, generate the intermediate days with groupby.cumcount+pd.to_timedelta and value_counts:
# ensure datetime
df[['date_from', 'date_to']] = df[['date_from', 'date_to']].apply(pd.to_datetime)
# number of days between from and to
n = df['date_to'].sub(df['date_from']).dt.days
# repeat the days
s = df.loc[df.index.repeat(n+1), 'date_from']
# increment to create intermediates and count
out = (s.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='day'))
.value_counts(sort=False)
)
Output:
2019-08-01 1
2019-08-02 1
2019-08-03 1
2019-08-04 2
2019-08-05 2
2019-08-06 1
2019-08-07 2
2019-08-08 1
2019-08-09 1
Name: count, dtype: int64
Intermediates:
# n
0 4
1 3
2 2
dtype: int64
# s
0 2019-08-01
0 2019-08-01
0 2019-08-01
0 2019-08-01
0 2019-08-01
1 2019-08-04
1 2019-08-04
1 2019-08-04
1 2019-08-04
2 2019-08-07
2 2019-08-07
2 2019-08-07
Name: date_from, dtype: datetime64[ns]
# s.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='day'))
0 2019-08-01
0 2019-08-02
0 2019-08-03
0 2019-08-04
0 2019-08-05
1 2019-08-04
1 2019-08-05
1 2019-08-06
1 2019-08-07
2 2019-08-07
2 2019-08-08
2 2019-08-09
dtype: datetime64[ns]
import numpy as np
out = pd.Series.value_counts(np.fromiter((d for f, t in
zip(df['date_from'], df['date_to'])
for d in pd.date_range(f,t)), 'datetime64[ns]'),
sort=False)
Output:
2019-08-01 1
2019-08-02 1
2019-08-03 1
2019-08-04 2
2019-08-05 2
2019-08-06 1
2019-08-07 2
2019-08-08 1
2019-08-09 1
Name: count, dtype: int64
on 3 rows:
# pandas repeat + groupby.cumcount+timedelta + value_counts
1.6 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# numpy iterator + value_counts
604 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
on 30K rows:
# pandas repeat + groupby.cumcount+timedelta + value_counts
18.1 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy iterator + value_counts
3.52 s ± 70.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With