I am working on a problem statement that requires me to fill the rows of missing dates (i.e dates in between two dates in columns of a pandas dataframe). Please see the example below. I am using Pandas for my current approach (mentioned below).
Input Data Example (which has around 25000 rows):
A | B | C | Date1 | Date2
a1 | b1 | c1 | 1Jan1990 | 15Aug1990 <- this row should be repeated for all dates between the two dates
.......................
a3 | b3 | c3 | 11May1986 | 11May1986 <- this row should NOT be repeated. Just 1 entry since both dates are same.
.......................
a5 | b5 | c5 | 1Dec1984 | 31Dec2017 <- this row should be repeated for all dates between the two dates
..........................
..........................
Output Expected:
A | B | C | Month | Year
a1 | b1 | c1 | 1 | 1990 <- Since date 1 column for this row was Jan 1990
a1 | b1 | c1 | 2 | 1990
.......................
.......................
a1 | b1 | c1 | 7 | 1990
a1 | b1 | c1 | 8 | 1990 <- Since date 2 column for this row was Aug 1990
..........................
a3 | b3 | c3 | 5 | 1986 <- only 1 row since two dates in input dataframe were same for this row.
...........................
a5 | b5 | c5 | 12 | 1984 <- since date 1 column for this row was Dec 1984
a5 | b5 | c5 | 1 | 1985
..........................
..........................
a5 | b5 | c5 | 11 | 2017
a5 | b5 | c5 | 12 | 2017 <- Since date 2 column for this row was Dec 2017
I know of more traditional way to achieve this (my current approach):
Since the input data has around 25000 rows, I believe the output data will be extremely very large, so I am looking for more Pythonic way to achieve this (if possible and faster than iterative approach)!
It looks to me like the best tool to use here is PeriodIndex
(to generate the months and years between dates).
However, PeriodIndex can only operate on one row at a time. So, if we are going to use PeriodIndex, every row has to be processed individually. That unfortunately means looping through the rows of the DataFrame:
import pandas as pd
df = pd.DataFrame([('a1','b1','c1','1Jan1990','15Aug1990'),
('a3','b3','c3','11May1986','11May1986'),
('a5','b5','c5','1Dec1984','31Dec2017')],
columns=['A','B','C','Date1','Date2'])
result = []
for tup in df.itertuples():
index = pd.PeriodIndex(start=tup.Date1, end=tup.Date2, freq='M')
new_df = pd.DataFrame([(tup.A, tup.B, tup.C)], index=index)
new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year
result.append(new_df)
result = pd.concat(result, axis=0)
print(result)
yields
0 1 2 Month Year
1990-01 a1 b1 c1 1 1990 <--- Beginning of row 1
1990-02 a1 b1 c1 2 1990
1990-03 a1 b1 c1 3 1990
1990-04 a1 b1 c1 4 1990
1990-05 a1 b1 c1 5 1990
1990-06 a1 b1 c1 6 1990
1990-07 a1 b1 c1 7 1990
1990-08 a1 b1 c1 8 1990 <--- End of row 1
1986-05 a3 b3 c3 5 1986 <--- Beginning and End of row 2
1984-12 a5 b5 c5 12 1984 <--- Beginning row 3
1985-01 a5 b5 c5 1 1985
1985-02 a5 b5 c5 2 1985
1985-03 a5 b5 c5 3 1985
1985-04 a5 b5 c5 4 1985
... .. .. .. ... ...
2017-09 a5 b5 c5 9 2017
2017-10 a5 b5 c5 10 2017
2017-11 a5 b5 c5 11 2017
2017-12 a5 b5 c5 12 2017 <--- End of row 3
[406 rows x 5 columns]
Note that you may not really need to define Month
and Year
columns
new_df['Month'] = new_df.index.month
new_df['Year'] = new_df.index.year
since you already have a PeriodIndex which makes computing months and years very easy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With