Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized implementation to create multiple rows from a single row in pandas dataframe

For each row in the input table, I need to generate multiple rows by separating the date range based on monthly. (please refer to the below sample output).

There is a simple iterative approach to convert row by row, but it is very slow on large dataframes.

Could anyone suggest a vectorized approach, such as using apply(), map() etc. to achieve the objective?

The output table is a new table.

Input:

ID, START_DATE, END_DATE
1, 2010-12-08, 2011-03-01
2, 2010-12-10, 2011-01-12
3, 2010-12-16, 2011-03-07

Output:

ID, START_DATE, END_DATE, NUMBER_DAYS, ACTION_DATE
1, 2010-12-08, 2010-12-31, 23, 201012
1, 2010-12-08, 2011-01-31, 54, 201101
1, 2010-12-08, 2011-02-28, 82, 201102
1, 2010-12-08, 2011-03-01, 83, 201103
2, 2010-12-10, 2010-12-31, 21, 201012
2, 2010-12-10, 2011-01-12, 33, 201101
3, 2010-12-16, 2010-12-31, 15, 201012
4, 2010-12-16, 2011-01-31, 46, 201101
5, 2010-12-16, 2011-02-28, 74, 201102
6, 2010-12-16, 2011-03-07, 81, 201103
like image 324
Yiliang Avatar asked May 05 '16 08:05

Yiliang


People also ask

How do I split a row into multiple rows in Pandas DataFrame?

How do you split a row in a data frame? Using the iloc() function to split DataFrame in Python We can use the iloc() function to slice DataFrames into smaller DataFrames. The iloc() function allows us to access elements based on the index of rows and columns.

How do I split a column into multiple rows in Python?

To split multiple array column data into rows pyspark provides a function called explode(). Using explode, we will get a new row for each element in the array.

How do I separate rows in Pandas?

Pandas explode() to separate list elements into separate rows() Now that we have column with list as elements, we can use Pandas explode() function on it. Pandas explode() function will split the list by each element and create a new row for each of them.

What are vectorized operations in Pandas?

Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.


1 Answers

I think you can use:

import pandas as pd

df = pd.DataFrame({'ID': {0: 1, 1: 2, 2: 3}, 
'END_DATE': {0: pd.Timestamp('2011-03-01 00:00:00'),
             1: pd.Timestamp('2011-01-12 00:00:00'), 
             2: pd.Timestamp('2011-03-07 00:00:00')}, 
'START_DATE': {0: pd.Timestamp('2010-12-08 00:00:00'), 
               1: pd.Timestamp('2010-12-10 00:00:00'), 
               2: pd.Timestamp('2010-12-16 00:00:00')}}, 
columns=['ID','START_DATE', 'END_DATE'])

print df
   ID START_DATE   END_DATE
0   1 2010-12-08 2011-03-01
1   2 2010-12-10 2011-01-12
2   3 2010-12-16 2011-03-07

#if multiple columns, you can filter them by subset
#df = df[['ID','START_DATE', 'END_DATE']]

#stack columns START_DATE and END_DATE
df1 = df.set_index('ID')
        .stack()
        .reset_index(level=1, drop=True)
        .to_frame()
        .rename(columns={0:'Date'})
#print df1

#resample and fill missing data 
df1 = df1.groupby(df1.index).apply(lambda x: x.set_index('Date').resample('M').asfreq())
         .reset_index()
print df1

   ID       Date
0   1 2010-12-31
1   1 2011-01-31
2   1 2011-02-28
3   1 2011-03-31
4   2 2010-12-31
5   2 2011-01-31
6   3 2010-12-31
7   3 2011-01-31
8   3 2011-02-28
9   3 2011-03-31

There is problem with last day of Month, because resample add last day of Month, so first create period columns and then merge them. By combine_first add missing values from column Date and by bfill add missing values of column START_DATE.

df['period'] = df.END_DATE.dt.to_period('M')
df1['period'] = df1.Date.dt.to_period('M')

df2 = pd.merge(df1, df, on=['ID','period'], how='left')

df2['END_DATE'] = df2.END_DATE.combine_first(df2.Date)
df2['START_DATE'] = df2.START_DATE.bfill()
df2 = df2.drop(['Date','period'], axis=1)

Last add new columns by difference with dt.days and dt.strftime:

df2['NUMBER_DAYS'] = (df2.END_DATE - df2.START_DATE).dt.days
df2['ACTION_DATE'] = df2.END_DATE.dt.strftime('%Y%m')

print df2
   ID START_DATE   END_DATE  NUMBER_DAYS ACTION_DATE
0   1 2010-12-08 2010-12-31           23      201012
1   1 2010-12-08 2011-01-31           54      201101
2   1 2010-12-08 2011-02-28           82      201102
3   1 2010-12-08 2011-03-01           83      201103
4   2 2010-12-10 2010-12-31           21      201012
5   2 2010-12-10 2011-01-12           33      201101
6   3 2010-12-16 2010-12-31           15      201012
7   3 2010-12-16 2011-01-31           46      201101
8   3 2010-12-16 2011-02-28           74      201102
9   3 2010-12-16 2011-03-07           81      201103
like image 143
jezrael Avatar answered Oct 12 '22 12:10

jezrael