Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting dates in a range of dates

I have a pandas dataframe that contains two date columns, a start date and an end date that defines a range. I'd like to be able to collect a total count for all dates across all rows in the dataframe, as defined by these columns.

index     date_from       date_to
    0  '2019-08-01'  '2019-08-05'
    1  '2019-08-04'  '2019-08-07'
    2  '2019-08-07'  '2019-08-09'

And I need to calculate the number of dates for all ranges. The result should be like this:

date           count
'2019-08-01'     1
'2019-08-02'     1
'2019-08-03'     1
'2019-08-04'     2
'2019-08-05'     2
'2019-08-06'     1
'2019-08-07'     2
'2019-08-08'     1
'2019-08-09'     1

I used a for-loop to solve the problem, but the calculations take a very long time because the original dataframe is quite large.

like image 975
Николай Черкашин Avatar asked Nov 21 '25 21:11

Николай Черкашин


1 Answers

One option could be to Index.repeat the rows based on the number of days, generate the intermediate days with groupby.cumcount+pd.to_timedelta and value_counts:

# ensure datetime
df[['date_from', 'date_to']] = df[['date_from', 'date_to']].apply(pd.to_datetime)

# number of days between from and to
n = df['date_to'].sub(df['date_from']).dt.days

# repeat the days
s = df.loc[df.index.repeat(n+1), 'date_from']
# increment to create intermediates and count
out = (s.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='day'))
        .value_counts(sort=False)
      )

Output:

2019-08-01    1
2019-08-02    1
2019-08-03    1
2019-08-04    2
2019-08-05    2
2019-08-06    1
2019-08-07    2
2019-08-08    1
2019-08-09    1
Name: count, dtype: int64

Intermediates:

# n
0    4
1    3
2    2
dtype: int64

# s
0   2019-08-01
0   2019-08-01
0   2019-08-01
0   2019-08-01
0   2019-08-01
1   2019-08-04
1   2019-08-04
1   2019-08-04
1   2019-08-04
2   2019-08-07
2   2019-08-07
2   2019-08-07
Name: date_from, dtype: datetime64[ns]

# s.add(pd.to_timedelta(s.groupby(level=0).cumcount(), unit='day'))
0   2019-08-01
0   2019-08-02
0   2019-08-03
0   2019-08-04
0   2019-08-05
1   2019-08-04
1   2019-08-05
1   2019-08-06
1   2019-08-07
2   2019-08-07
2   2019-08-08
2   2019-08-09
dtype: datetime64[ns]

another option, using a generator

import numpy as np

out = pd.Series.value_counts(np.fromiter((d for f, t in
                                          zip(df['date_from'], df['date_to'])
                              for d in pd.date_range(f,t)), 'datetime64[ns]'),
                             sort=False)

Output:

2019-08-01    1
2019-08-02    1
2019-08-03    1
2019-08-04    2
2019-08-05    2
2019-08-06    1
2019-08-07    2
2019-08-08    1
2019-08-09    1
Name: count, dtype: int64

Timings

on 3 rows:

# pandas repeat + groupby.cumcount+timedelta + value_counts
1.6 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# numpy iterator + value_counts
604 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

on 30K rows:

# pandas repeat + groupby.cumcount+timedelta + value_counts
18.1 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# numpy iterator + value_counts
3.52 s ± 70.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 182
mozway Avatar answered Nov 24 '25 21:11

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!