Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - Splitting date ranges on specific day boundary

I've got a DataFrame of date ranges (the actual DataFrame has more data attached to it but has the same start and end columns). The data ultimately needs to be analyzed week-by-week on a Sunday-Saturday basis. Thus, I'd like to go through the DataFrame, and split any date ranges (start to finish) that cross from a Saturday to Sunday. For example, given the DataFrame:

import pandas as pd

date_ranges = [
    {'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
    {'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
    {'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
    {'start': '2020-01-18 22:30:00', 'end': '2020-01-19 02:00:00'}  # spans sat-sun, must split
]
data_df = pd.DataFrame(date_ranges)

I want my result to look like:

result_ranges = [
    {'start': '2020-01-16 22:30:00', 'end': '2020-01-17 01:00:00'}, # spans thurs-fri, ok as is
    {'start': '2020-01-17 04:30:00', 'end': '2020-01-17 12:30:00'}, # no span, ok as is
    {'start': '2020-01-18 10:15:00', 'end': '2020-01-18 14:00:00'}, # no span, ok as is
    {'start': '2020-01-18 22:30:00', 'end': '2020-01-19 00:00:00'}, # split out saturday portion
    {'start': '2020-01-19 00:00:00', 'end': '2020-01-19 02:00:00'}  # and the sunday portion
]

result_df = pd.DataFrame(result_ranges)

Any thoughts on how to effectively do this in pandas would be greatly appreciated. Currently I am doing the bad thing, and iterating over rows, and it is quite slow when the data set gets large.

like image 759
MarkD Avatar asked Jan 17 '20 17:01

MarkD


People also ask

How do you split a date range in Python?

Method #1 : Using loop In this, we compute each segment duration using division of whole duration by N. Post that, each date is built using segment duration multiplication in loop.

How do I select a specific date range in Python?

There are two possible solutions: Use a boolean mask, then use df. loc[mask] Set the date column as a DatetimeIndex, then use df[start_date : end_date]


2 Answers

Manipulations like this are always difficult and at some level I think a loop is necessary. In this case, instead of looping over the rows, we can loop over the edges. This should lead to a rather big gain in performance when the number of weeks your data span is much smaller than the number of rows you have.

We define edges and modify the DataFrame endpoints where necessary. In the end the desired DataFrame is whatever is left of the DataFrame we modified, plus all the separate timespans we stored in l. The original Index is preserved, so you can see exactly what rows were split. If a single timespan straddles N edges it gets split into N+1 separate rows.

Setup

import pandas as pd

df[['start', 'end']]= df[['start', 'end']].apply(pd.to_datetime)

edges = pd.date_range(df.start.min().normalize() - pd.Timedelta(days=7),
                      df.end.max().normalize() + pd.Timedelta(days=7), freq='W-Sun')

Code

l = []
for edge in edges:
    m = df.start.lt(edge) & df.end.gt(edge)  # Rows to modify
    l.append(df.loc[m].assign(end=edge))     # Clip end of modified rows
    df.loc[m, 'start'] = edge                # Fix start for next edge

result = pd.concat(l+[df]).sort_values('start')

Output

                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
3 2020-01-19 00:00:00 2020-01-19 02:00:00
like image 102
ALollz Avatar answered Sep 28 '22 17:09

ALollz


My solution is even more general that you defined, namely it creates a sequence of "week rows" from each source row, even if both dates contain between them e.g. two Sat/Sun breaks.

To check that it works, I added one such row to your DataFrame, so that it contains:

                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 02:00:00
4 2020-01-25 20:30:00 2020-02-02 03:00:00

Note that the last row includes two Sat/Sun break, from 25.01 to 26.01 and from 1.02 to 2.02.

Start from conversion of both columns to datetime:

data_df.start = pd.to_datetime(data_df.start)
data_df.end = pd.to_datetime(data_df.end)

To process your data, define the following function, to be applied to each row:

def weekRows(row):
    row.index = pd.DatetimeIndex(row)
    gr = row.resample('W-SUN', closed='left')
    ngr = gr.ngroups  # Number of groups
    i = 1
    data = []
    for key, grp in gr:
        dt1 = key - pd.Timedelta('7D')
        dt2 = key
        if i == 1:
            dt1 = row.iloc[0]
        if i == ngr:
            dt2 = row.iloc[1]
        data.append([dt1, dt2])
        i += 1
    return pd.DataFrame(data, columns=['start', 'end'])

Let's present "individually", how it operates on 2 last rows:

When you run:

row = data_df.loc[3]
weekRows(row)

(for the last but one row), you will get:

                start                 end
0 2020-01-18 22:30:00 2020-01-19 00:00:00
1 2020-01-19 00:00:00 2020-01-19 02:00:00

And when you run:

row = data_df.loc[4]
weekRows(row)

(for the last), you will get:

                start                 end
0 2020-01-25 20:30:00 2020-01-26 00:00:00
1 2020-01-26 00:00:00 2020-02-02 00:00:00
2 2020-02-02 00:00:00 2020-02-02 03:00:00

And to get your desired result, run:

result = pd.concat(data_df.apply(weekRows, axis=1).values, ignore_index=True)

The result is:

                start                 end
0 2020-01-16 22:30:00 2020-01-17 01:00:00
1 2020-01-17 04:30:00 2020-01-17 12:30:00
2 2020-01-18 10:15:00 2020-01-18 14:00:00
3 2020-01-18 22:30:00 2020-01-19 00:00:00
4 2020-01-19 00:00:00 2020-01-19 02:00:00
5 2020-01-25 20:30:00 2020-01-26 00:00:00
6 2020-01-26 00:00:00 2020-02-02 00:00:00
7 2020-02-02 00:00:00 2020-02-02 03:00:00

First 3 rows result from your first 3 source rows. Two next rows (index 3 and 4) result from source row with index 3. And the last 3 row (index 5 thru 7) result from the last source row.

like image 26
Valdi_Bo Avatar answered Sep 28 '22 15:09

Valdi_Bo