How would you optimize this short but very slow Python loop?

Question

I'm switching from R to Python. Unfortunately, I found that while some structures run almost instantly in R, they take some seconds (and even minutes) in Python. Upon reading I found for loops are strongly discouraged in pandas, and other alternatives such as vectorization and apply are recommended.

In this sample code: From a column of values that are sorted from min to max, keep all the values that come first after a gap of length '200'.

import numpy as np
import pandas as pd

#Let's create the sample data. It consists of a column with random sorted values, and an extra True/False column, where we will flag the values we want
series = np.random.uniform(1,1000000,100000)
test = [True]*100000
data = pd.DataFrame({'series' : series, 'test':test })
data.sort_values(by=['series'], inplace=True)

#Loop to get rid of the next values that fall within the '200' threshold after the first next valid value
for i in data['series']:
    if data.loc[data['series'] == i,'test'].item() == True:
        data.loc[(data['series'] > i) & (data['series'] <= i+200  ) ,'test' ] = False
#Finally, let's keep the first values after any'200' threshold             
data = data.loc[data['test']==True , 'series']

Is it possible to turn this into a function, vectorize, apply, or any other structure other than 'for' loop to make it run almost instantly?

Quang Hoang · Accepted Answer

This is my approach with a while loop:

head = 0
indexes = []
while head < len(data):
    thresh = data['series'].iloc[head] + 200
    indexes.append(head)
    head += 1
    while head < len(data) and data['series'].iloc[head] < thresh:
        head+=1

# output:
data = data.iloc[indexes]

# double check with your approach
set(data.loc[data['test']].index) == set(data.iloc[indexes].index)
# output: True

The above took 984ms while your approach took 56s.

kaya3 · Answer

You can do it with a simple, one-pass algorithm using one loop over the series; no need for vectorisation or anything like that. It takes 33 milliseconds on my machine, so not "instantaneous", but blink and you'll miss it.

def first_after_gap(series, gap=200):
    out = []
    last = float('-inf')
    for x in series:
        if x - last >= gap:
            out.append(x)
            last = x
    return out

Example:

>>> import numpy as np
>>> series = sorted(np.random.uniform(1, 1000000, 100000))
>>> from timeit import timeit
>>> timeit(lambda: first_after_gap(series), number=1)
0.03264855599991279

How would you optimize this short but very slow Python loop?

Tags:

python

loops

pandas

Nahuel Patiño

2 Answers

Quang Hoang

kaya3

Recent Activity

Donate For Us

How would you optimize this short but very slow Python loop?

Tags:

python

loops

pandas

Nahuel Patiño

2 Answers

Quang Hoang

kaya3

Related questions

Recent Activity

Donate For Us