Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gaps in dates while rolling up quarters into a single row

I am attempting to roll-up rows from a data set with similar measures into a consolidated row. There are two conditions that must be met for the roll-up:

  • The measures (ranging from 1-5) should remain the same across the rows for them to be rolled up to a single row.
  • The dates should be continuous (no gaps in dates).

If these conditions are not met, the code should generate a separate row.

This is the sample data that I am using:

id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,6/30/2019
ABC123XYZ789,1,1,1,1,1,7/1/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,3/31/2019
FGH589J6U88SW,1,1,1,1,1,4/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,9/30/2019
FGH589J6U88SW,1,1,1,2,1,10/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,3/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,4/1/2019,6/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,7/1/2019,9/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,10/1/2019,12/31/2019

The expected result should be:

id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,12/31/2019

I have implemented the code below which seems to address condition # 1, but I am looking for ideas on how to incorporate condition # 2 into the solution.

import pandas as pd
import time

startTime=time.time()
data=pd.read_csv('C:\\Users\\usertemp\\Data\\Rollup2.csv')

data['end_date']= pd.to_datetime(data['end_date']) 
data['begin_date']= pd.to_datetime(data['begin_date']) 

data = data.groupby(['id','measure1','measure2', 'measure3', 'measure4', 'measure5']) \
    ['begin_date', 'end_date'].agg({'begin_date': ['min'], 'end_date': ['max']}).reset_index()
print(data)
print("It took %s seconds for the collapse process" % (time.time() - startTime))

Any help is appreciated.

like image 603
user12593294 Avatar asked Jan 20 '26 09:01

user12593294


1 Answers

You can do the following.

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Convert begin_date and end_time to datetime
df['begin_date'] = pd.to_datetime(df['begin_date'], format='%m/%d/%Y')
df['end_date']= pd.to_datetime(df['end_date'], format='%m/%d/%Y')

# We create a new column which contains the end_date+1 from the previous row
df['end_date_prev'] = df['end_date'].iloc[:-1] + timedelta(days=1)
df['end_date_prev'] = np.roll(df['end_date_prev'], 1)

# Create a cumsum that resets when begin_date and end_date_prev doesn't match
df['cont'] = (~(df['begin_date'] == df['end_date_prev'])).astype(int).cumsum()

# Since we need all measures to match we create a string column containing all measurements
df['comb_measure'] = df['measure1'].astype(str).str.cat(df[['measure{}'.format(i) for i in range(2,6)]].astype(str))

# Get the final df
new_df = df.groupby(['id', 'comb_measure', 'cont']).agg(
    {'measure1':'first', 'measure2':'first', 'measure3':'first', 'measure4':'first', 'measure5':'first', 
     'begin_date':'first', 'end_date':'last'})
like image 157
thushv89 Avatar answered Jan 22 '26 21:01

thushv89