Speeding up loop when normalizing Pandas data

I have a pandas dataframe:

|  col1  | heading |
|--------|---------|
|heading1|   true  |
|abc     |  false  |
|efg     |  false  |
|hij     |  false  |
|heading2|   true  |
|klm     |  false  |
|...     |  false  |

This data is actually "sequential" and I would like to transform it to this structure:

|  col1  |  Parent   |
|---------------------
|heading1|  heading1 |
|abc     |  heading1 | 
|efg     |  heading1 |
|hij     |  heading1 |
|heading2|  heading2 |
|klm     |  heading2 |
|...     |  headingN |

I have +10M rows so this method takes too long:

df['Parent'] = df['col1']

for index, row in df.iterrows():
    if row['heading']:
        current = row['col1']
    else:
        row.loc[index, 'Parent'] = current

Do you have any advice on a faster process?

Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Is pandas apply faster than for loop?

apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).

How do I iterate through pandas DataFrame fast?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

You can use a mask with ffill:

df.assign(heading=df.col1.mask(~df.col1.str.startswith('heading')).ffill())

       col1   heading
0  heading1  heading1
1       abc  heading1
2       efg  heading1
3       hij  heading1
4  heading2  heading2
5       klm  heading2

This works by replacing any value that does not start with heading with NaN, and then fills the last non-nan value forward:

df.col1.mask(~df.col1.str.startswith('heading'))

0    heading1
1         NaN
2         NaN
3         NaN
4    heading2
5         NaN
Name: col1, dtype: object

df.col1.mask(~df.col1.str.startswith('heading')).ffill()

0    heading1
1    heading1
2    heading1
3    heading1
4    heading2
5    heading2
Name: col1, dtype: object

Speeding up loop when normalizing Pandas data

Tags:

python

pandas

vectorization

numpy

user1965074

People also ask

1 Answers

user3483203

Recent Activity

Donate For Us

Speeding up loop when normalizing Pandas data

Tags:

python

pandas

vectorization

numpy

user1965074

People also ask

1 Answers

user3483203

Related questions

Recent Activity

Donate For Us