Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to vectorize this peak finding for loop in Python?

Basically I'm writing a peak finding function that needs to be able to beat scipy.argrelextrema in benchmarking. Here is a link to the data I'm using, and the code:

https://drive.google.com/open?id=1U-_xQRWPoyUXhQUhFgnM3ByGw-1VImKB

If this link expires, the data can be found at dukascopy bank's online historical data downloader.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('EUR_USD.csv')
data.columns = ['Date', 'open', 'high', 'low', 'close','volume']

data.Date = pd.to_datetime(data.Date, format='%d.%m.%Y %H:%M:%S.%f')

data = data.set_index(data.Date)

data = data[['open', 'high', 'low', 'close']]

data = data.drop_duplicates(keep=False)

price = data.close.values

def fft_detect(price, p=0.4):

    trans = np.fft.rfft(price)
    trans[round(p*len(trans)):] = 0
    inv = np.fft.irfft(trans)
    dy = np.gradient(inv)
    peaks_idx = np.where(np.diff(np.sign(dy)) == -2)[0] + 1
    valleys_idx = np.where(np.diff(np.sign(dy)) == 2)[0] + 1

    patt_idx = list(peaks_idx) + list(valleys_idx)
    patt_idx.sort()

    label = [x for x in np.diff(np.sign(dy)) if x != 0]

    # Look for Better Peaks

    l = 2

    new_inds = []

    for i in range(0,len(patt_idx[:-1])):

        search = np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1))

        if label[i] == -2:
            idx = price[search].argmax()
        elif label[i] == 2:
            idx = price[search].argmin()

        new_max = search[idx]
        new_inds.append(new_max)

    plt.plot(price)
    plt.plot(inv)
    plt.scatter(patt_idx,price[patt_idx])
    plt.scatter(new_inds,price[new_inds],c='g')
    plt.show()

    return peaks_idx, price[peaks_idx]

It basically smoothes data using a fast fourier transform (FFT) then takes the derivative to find the minimum and maximum indices of the smoothed data, then finds the corresponding peaks on the unsmoothed data. Sometimes the peaks it finds are not idea due to some smoothing effects, so I run this for loop to search for higher or lower points for each index between the bounds specified by l. I need help vectorizing this for loop! I have no idea how to do it. Without the for loop, my code is about 50% faster than scipy.argrelextrema, but the for loop slows it down. So if I can find a way to vectorize it, it'd be a very quick, and very effective alternative to scipy.argrelextrema. These two images represent the data without and with the for loop respectively.

Before the 'for' loop was added, peaks are not ideal With the 'for' loop, peaks are much better

like image 982
ddm-j Avatar asked Aug 26 '18 20:08

ddm-j


2 Answers

This may do it. It's not perfect but hopefully it obtains what you want and shows you a bit how to vectorize. Happy to hear any improvements you think up

label = np.array(label[:-1]) # not sure why this is 1 unit longer than search.shape[0]? 

# the idea is to make the index matrix you're for looping over row by row all in one go. 
# This part is sloppy and you can improve this generation. 

search = np.vstack((np.arange(patt_idx[i]-(l+1),patt_idx[i]+(l+1)) for i in range(0,len(patt_idx[:-1])))) # you can refine this. 

# then you can make the price matrix

price = price[search]

# and you can swap the sign of elements so you only need to do argmin instead of both argmin and argmax 

price[label==-2] = - price[label==-2]

# now find the indices of the minimum price on each row 

idx = np.argmin(price,axis=1)

# and then extract the refined indices from the search matrix 

new_inds = search[np.arange(idx.shape[0]),idx] # this too can be cleaner. 
# not sure what's going on here so that search[:,idx] doesn't work for me
# probably just a misunderstanding 

I find that this reproduces your result but I did not time it. I suspect the search generation is quite slow but probably still faster than your for loop.

Edit:

Here's a better way to produce search:

patt_idx = np.array(patt_idx)
starts = patt_idx[:-1]-(l+1)
stops = patt_idx[:-1]+(l+1)
ds = stops-starts
s0 = stops.shape[0]
s1 = ds[0]
search = np.reshape(np.repeat(stops - ds.cumsum(), ds) + np.arange(ds.sum()),(s0,s1))
like image 103
kevinkayaks Avatar answered Nov 08 '22 11:11

kevinkayaks


Here is an alternative... it uses list comprehension which is generally faster than for-loops

l = 2

# Define the bounds beforehand, its marginally faster than doing it in the loop
upper = np.array(patt_idx) + l + 1
lower = np.array(patt_idx) - l - 1

# List comprehension...
new_inds = [price[low:hi].argmax() + low if lab == -2 else 
            price[low:hi].argmin() + low 
            for low, hi, lab in zip(lower, upper, label)]

# Find maximum within each interval
new_max = price[new_inds]
new_global_max = np.max(new_max)
like image 45
mortysporty Avatar answered Nov 08 '22 13:11

mortysporty