I have a dataframe with around around 9k rows and 57 cols, this is 'df'.
I need to have a new dataframe: 'df_final' - for each row of 'df' i have to replicate each row 'x' times and increase the day in each row one by one, also 'x' times. While i can do this for a couple of iterations, when i do it for the full length of 'df' ' len(df)' the loop it takes so long (>3 hours) that i actually had to cancel it. I have never seen the end of it. Here's the current code:
df.shape
output: (9454, 57)
df_int = df[0:0]
df_final = df_int[0:0]
range_df = len(df)
for x in range(0,2):
df_int = df.iloc[0+x:x+1]
if abs(df_int.iat[-1,3]) > 0:
df_int = pd.concat([df_int]*abs(df_int.iat[-1,3]), ignore_index=True)
for i in range(1, abs(df_int.iat[-1,3])):
df_int['Consumption Date'][i] = df_int['Consumption Date'][i-1] + datetime.timedelta(days = 1)
i += 1
df_final = df_final.append(df_int, ignore_index=True)
x += 1
The result of the loops for the first two rows of ' df' are below.
First two rows of df:
Desired result:
Is there another way to get to the desired output. It seems pandas do not deal very well with loops. In VBA excel the same loop takes around 3/4 minutes...i am trying to change a process which is currently in excel to python, however, if there's no way to make this work i guess i will stick to the old ways...
To create a data frame with a column having repeated values, we simply need to use rep function and we can repeat the values in a sequence of the values passed or repeating each value a particular number of times.
The copy() method returns a copy of the DataFrame. By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Use repeat
and cumcount
In [2972]: dff = df.loc[df.index.repeat(3)]
In [2973]: dff
Out[2973]:
date name
0 2017-05-03 bob
0 2017-05-03 bob
0 2017-05-03 bob
1 2017-06-13 sally
1 2017-06-13 sally
1 2017-06-13 sally
In [2974]: dff.loc[:, 'date'] += pd.to_timedelta(dff.groupby(level=0).cumcount(), 'D')
In [2975]: dff
Out[2975]:
date name
0 2017-05-03 bob
0 2017-05-04 bob
0 2017-05-05 bob
1 2017-06-13 sally
1 2017-06-14 sally
1 2017-06-15 sally
Details
In [2976]: df
Out[2976]:
date name
0 2017-05-03 bob
1 2017-06-13 sally
In [2977]: dff.groupby(level=0).cumcount()
Out[2977]:
0 0
0 1
0 2
1 0
1 1
1 2
dtype: int64
Let's use this toy DataFrame:
df = pd.DataFrame({
'date': pd.to_datetime(['2017-05-03', '2017-06-13']),
'name': ['bob', 'sally'],
})
It looks like this:
date name
0 2017-05-03 bob
1 2017-06-13 sally
Then:
x = 3 # repeat count
ind = np.repeat(np.arange(len(df)), x) # 0,0,0,1,1,1
df_final = df.iloc[ind].copy()
That gives you the repeats:
date name
0 2017-05-03 bob
0 2017-05-03 bob
0 2017-05-03 bob
1 2017-06-13 sally
1 2017-06-13 sally
1 2017-06-13 sally
Now you just need to increment the dates:
inc = np.tile(np.arange(x), len(df)) # 0,1,2,0,1,2
df_final.date += pd.to_timedelta(inc, 'D')
And you get:
date name
0 2017-05-03 bob
0 2017-05-04 bob
0 2017-05-05 bob
1 2017-06-13 sally
1 2017-06-14 sally
1 2017-06-15 sally
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With