I have a pandas dataframe:
| col1 | heading |
|--------|---------|
|heading1| true |
|abc | false |
|efg | false |
|hij | false |
|heading2| true |
|klm | false |
|... | false |
This data is actually "sequential" and I would like to transform it to this structure:
| col1 | Parent |
|---------------------
|heading1| heading1 |
|abc | heading1 |
|efg | heading1 |
|hij | heading1 |
|heading2| heading2 |
|klm | heading2 |
|... | headingN |
I have +10M rows so this method takes too long:
df['Parent'] = df['col1']
for index, row in df.iterrows():
if row['heading']:
current = row['col1']
else:
row.loc[index, 'Parent'] = current
Do you have any advice on a faster process?
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
You can use a mask
with ffill
:
df.assign(heading=df.col1.mask(~df.col1.str.startswith('heading')).ffill())
col1 heading
0 heading1 heading1
1 abc heading1
2 efg heading1
3 hij heading1
4 heading2 heading2
5 klm heading2
This works by replacing any value that does not start with heading
with NaN
, and then fills the last non-nan value forward:
df.col1.mask(~df.col1.str.startswith('heading'))
0 heading1
1 NaN
2 NaN
3 NaN
4 heading2
5 NaN
Name: col1, dtype: object
df.col1.mask(~df.col1.str.startswith('heading')).ffill()
0 heading1
1 heading1
2 heading1
3 heading1
4 heading2
5 heading2
Name: col1, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With