For each row in the input table, I need to generate multiple rows by separating the date range based on monthly. (please refer to the below sample output).
There is a simple iterative approach to convert row by row, but it is very slow on large dataframes.
Could anyone suggest a vectorized approach, such as using apply(), map() etc. to achieve the objective?
The output table is a new table.
Input:
ID, START_DATE, END_DATE
1, 2010-12-08, 2011-03-01
2, 2010-12-10, 2011-01-12
3, 2010-12-16, 2011-03-07
Output:
ID, START_DATE, END_DATE, NUMBER_DAYS, ACTION_DATE
1, 2010-12-08, 2010-12-31, 23, 201012
1, 2010-12-08, 2011-01-31, 54, 201101
1, 2010-12-08, 2011-02-28, 82, 201102
1, 2010-12-08, 2011-03-01, 83, 201103
2, 2010-12-10, 2010-12-31, 21, 201012
2, 2010-12-10, 2011-01-12, 33, 201101
3, 2010-12-16, 2010-12-31, 15, 201012
4, 2010-12-16, 2011-01-31, 46, 201101
5, 2010-12-16, 2011-02-28, 74, 201102
6, 2010-12-16, 2011-03-07, 81, 201103
How do you split a row in a data frame? Using the iloc() function to split DataFrame in Python We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns.
To split multiple array column data into rows pyspark provides a function called explode(). Using explode, we will get a new row for each element in the array.
Pandas explode() to separate list elements into separate rows() Now that we have column with list as elements, we can use Pandas explode() function on it. Pandas explode() function will split the list by each element and create a new row for each of them.
Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.
I think you can use:
import pandas as pd
df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3},
'END_DATE': {0: pd.Timestamp('2011-03-01 00:00:00'),
1: pd.Timestamp('2011-01-12 00:00:00'),
2: pd.Timestamp('2011-03-07 00:00:00')},
'START_DATE': {0: pd.Timestamp('2010-12-08 00:00:00'),
1: pd.Timestamp('2010-12-10 00:00:00'),
2: pd.Timestamp('2010-12-16 00:00:00')}},
columns=['ID','START_DATE', 'END_DATE'])
print df
ID START_DATE END_DATE
0 1 2010-12-08 2011-03-01
1 2 2010-12-10 2011-01-12
2 3 2010-12-16 2011-03-07
#if multiple columns, you can filter them by subset
#df = df[['ID','START_DATE', 'END_DATE']]
#stack columns START_DATE and END_DATE
df1 = df.set_index('ID')
.stack()
.reset_index(level=1, drop=True)
.to_frame()
.rename(columns={0:'Date'})
#print df1
#resample and fill missing data
df1 = df1.groupby(df1.index).apply(lambda x: x.set_index('Date').resample('M').asfreq())
.reset_index()
print df1
ID Date
0 1 2010-12-31
1 1 2011-01-31
2 1 2011-02-28
3 1 2011-03-31
4 2 2010-12-31
5 2 2011-01-31
6 3 2010-12-31
7 3 2011-01-31
8 3 2011-02-28
9 3 2011-03-31
There is problem with last day of Month
, because resample
add last day of Month
, so first create period
columns and then merge
them. By combine_first
add missing values from column Date
and by bfill
add missing values of column START_DATE
.
df['period'] = df.END_DATE.dt.to_period('M')
df1['period'] = df1.Date.dt.to_period('M')
df2 = pd.merge(df1, df, on=['ID','period'], how='left')
df2['END_DATE'] = df2.END_DATE.combine_first(df2.Date)
df2['START_DATE'] = df2.START_DATE.bfill()
df2 = df2.drop(['Date','period'], axis=1)
Last add new columns by difference with dt.days
and dt.strftime
:
df2['NUMBER_DAYS'] = (df2.END_DATE - df2.START_DATE).dt.days
df2['ACTION_DATE'] = df2.END_DATE.dt.strftime('%Y%m')
print df2
ID START_DATE END_DATE NUMBER_DAYS ACTION_DATE
0 1 2010-12-08 2010-12-31 23 201012
1 1 2010-12-08 2011-01-31 54 201101
2 1 2010-12-08 2011-02-28 82 201102
3 1 2010-12-08 2011-03-01 83 201103
4 2 2010-12-10 2010-12-31 21 201012
5 2 2010-12-10 2011-01-12 33 201101
6 3 2010-12-16 2010-12-31 15 201012
7 3 2010-12-16 2011-01-31 46 201101
8 3 2010-12-16 2011-02-28 74 201102
9 3 2010-12-16 2011-03-07 81 201103
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With