Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting patterns from two arrays of data in Python

I'm trying to detect patterns from open-high-low-close (OHLC) data, so here is what I did:

  1. Find local minima and maxima on the dataset
  2. Normalize my data by converting the array of local minima and maxima to an array of numbers, where every number is the variation from the previous point.

Until now, everything works, but I got stuck on the following part. I defined an array of data, which is a pattern, that when is plotted on a chart will have a certain shape. I'm now trying to find, on other datasets, shapes that are similar to the pattern I specified.

Here is the pattern specified by me:

Pattern = [7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172]

And here is a sample dataset:

SampleTarget = [-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067]

I'm looking for a way to detect when, at a certain point, on SampleTarget, is spotted a series of values that are similar to Pattern.

In this case, for example, I need to detect, somehow, that there is a part of SampleTarget where the values are similar to Pattern, since it's the same dataset from which i extracted Pattern.

What I tried:

I've been suggested to use numpy.correlate, python-dtw (Dynamic time warping), or stumpy but the problem I encountered with those is the lack of practical examples on this particular matter.

like image 217
Jack022 Avatar asked Sep 13 '25 11:09

Jack022


2 Answers

Here is a trick to do it:

import numpy as np
pat = np.array(Pattern)
data = np.array(SampleTarget)
n = len(data)
m = len(pat)
k = data.strides[0] # typically 8 for float64

# data2d is a view to the original data,
# with data_2d[:-m, 6] == data_2d[1:1-m, 5] == ... == data_2d[6:, 0]
data_2d = np.lib.stride_tricks.as_strided(data, shape=(n-m+1, m), strides=(k, k))

# So you can check for matches on data[i, :] for all i
print(np.all(np.isclose(data_2d, pat), axis=1))

Output:

array([False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False])

You can use np.where or np.argwhere to get the index of the match(es). You can tune the atol and rtol parameters of np.isclose to set the threshold for an approximate match.

Clarification: if you do the as_strided trick on data=np.arange(30), then data2d will be:

array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 1,  2,  3,  4,  5,  6,  7],
       [ 2,  3,  4,  5,  6,  7,  8],
       ...
       [21, 22, 23, 24, 25, 26, 27],
       [22, 23, 24, 25, 26, 27, 28],
       [23, 24, 25, 26, 27, 28, 29]])

EDIT: This is an efficient way to create a view of the same data with a sliding windows, without requiring extra memory. A numpy array lookup a[i, j] finds the memory address as start_address + a.strides[0]*i + a.strides[1]*j; by setting strides to (8, 8), where 8 is the size of a float value, you achieve the sliding-window effect. Because different array elements refer to the same memory, it's best to treat an array constructed this way as read-only.

EDIT: if you want to have a "score" metric for the quality of the match, you can for example do this:

>>> np.linalg.norm(data_2d - pat, axis=1) 

array([17.5, 17.4, 13.3, 20.5, 12.9, 14.9, 19.7,  0. , 17.4, 13.8, 16.9,
       13.7, 19. , 10.3, 18.3, 15.2, 10.9, 22.3, 13. , 21.8, 15.2, 24.5,
       14.9, 20.7])
# (numbers rounded to reduce clutter)

closer to zero means a better match. Here, norm takes the length of the difference vector d=data-pat, i.e., sqrt(d[0]**2 + ... + d[m-1]**2).

EDIT: If you are interested in patterns that have the same shape, but are scaled to a larger or smaller value, you can do this:

# New dataset with two occurrences of the pattern: one scaled by a factor 1.1,
# one scaled 0.5 with a bit of noise added
data_mod = data*1.1
np.random.seed(1)
data_mod[16:16+m] = pat*0.5 + np.random.uniform(-0.5, 0.5, size=m)
data_2d_mod = np.lib.stride_tricks.as_strided(
    data_mod, shape=(n-m+1, m), strides=(k, k))

# pat_inv: pseudoinverse of pat vector
pat_inv = 1/(pat @ pat) * pat 

# cofs: fit coefficients, shape (n1,)
cofs = data_2d_mod @ pat_inv # fit coefficients, shape (n1,)

# sum of squared residuals, shape (n1,) - zero means perfect fit
ssqr = ((data_2d_mod - cofs.reshape(-1, 1) * pat)**2).sum(axis=1)

print(f'cofs:\n{np.around(cofs, 2)}')
print(f'ssqr:\n{np.around(ssqr, 1)}')

Result:

cofs:
[-0.38 -0.14  0.4  -0.54  0.59  0.36 -0.48  1.1  -0.33  0.12 -0.06  0.18
 -0.21  0.23  0.22 -0.33  0.52 -0.2   0.22 -0.35  0.6  -0.91  0.92  0.01]
ssqr:
[ 81.6 161.8 147.4 155.1 167.3 196.1 138.6   0.   97.8 103.5  85.9  59.3
  57.1  54.9  58.3  29.2   0.7 198.7 217.4 201.9 266.3 235.1 242.8 361.9]

You see that cofs[7] == 1.1, meaning that the pattern had to be scaled by a factor 1.1 on the corresponding data window for a best fit. The fit was perfect, which you can see from ssqr[7] == 0. It also finds the other one, with cofs[16] == 0.52 (close to the expected 0.5 value) and ssqr[16] == 0.7.

Other example: cofs[21]==-0.91 and ssqr[12]==235.1. This means that data_mod[12:19] somewhat resembles the pattern, but inverted (positive and negative swapped). It depends on what you want to do with the data; most likely you'd like to look at cofs values in the range 0.5 to 2: your search pattern is allowed to occur in the data a factor 2 larger or smaller. This should be combined with sufficiently small ssqr values.

Here you see the three potential matches in a graph:

data with pattern matches

If you use ssqr as a score metric, be aware that a series of zeros in the input will result in cofs=0 and ssqr=0.

Consider using np.sqrt(ssqr/m)/np.abs(cofs) as a metric instead, for two reasons. (1) it will match according to relative error and result in NaN values in the case of zero input. (2) it is more intuitive; if the value is 0.5, it means that the data points deviate by about 0.5 from the pattern values. Here is are the values for this metric, using the same example data:

[ 9.1  35.3  11.6  8.8   8.3  14.8   9.4   0.  11.4  33.3 55.9  16.4
 13.9  12.1  12.9  6.2   0.6  27.2  25.4 15.2  10.4  6.4   6.4 482.5]

For the match at data_mod[21:28], the difference metric is 6.4, which corresponds roughly to the differences as seen in the plot.

like image 165
Han-Kwang Nienhuys Avatar answered Sep 15 '25 02:09

Han-Kwang Nienhuys


To find a known pattern, Q, from an independent time series, T, with the STUMPY Python package you'll need to do something like this:

from stumpy.core import mass
import numpy as np

Pattern = np.array([7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172])

SampleTarget = np.array([-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067])

distance_profile = mass(Pattern, SampleTarget)

# Output of `distance_profile`
array([4.55219811, 4.21544139, 3.29336127, 4.72614564, 2.94202855,
       3.33790488, 4.62672866, 0.        , 4.51937582, 3.47144433,
       4.17966567, 3.26871969, 4.72146046, 2.53070957, 4.46398626,
       3.64503919, 2.64282983, 4.81577841, 2.69799924, 4.64286098,
       2.67446216, 4.52739326, 2.54663088, 3.79885921])

Essentially, the mass function computes a distance_profile by taking your Pattern and sliding a window (that is the same length as your Pattern) along your SampleTarget and calculating the z-normalized Euclidean distance. Each "windowis referred to as a subsequence and each element of thedistance_profilecorresponds to the distance between one subsequence and yourPattern`.

So, for instance, the distance between your Pattern and the first subsequence, SampleTarget[0:0+len(Pattern)], is distance_profile[0] = 4.55219811.

Similarly, the distance between your Pattern and the first subsequence, SampleTarget[1:1+len(Pattern)], is distance_profile[1] = 4.21544139.

And, generally, the distance between your Pattern and the ith subsequence, SampleTarget[i:i+len(Pattern)], is distance_profile[i].

Now, to find the parts of SampleTarget that are "closest" to Pattern, you can look for the smallest values in your distance_profile and then use the corresponding index from your distance_profile to cross reference the index from your SampleTarget.

More concretely, using our example from above, the smallest value found in distance_profile is 0 (a perfect match) and this is found at index i = 7. So, now you should find that SampleTarget[7:7+len(Pattern)] should be identical to Pattern. Note that STUMPY (and mass) doesn't care whether or not an identical match exists. What you'll likely want to do is decide on a reasonable distance threshold/cutoff and examine all "matches" that fall below this distance threshold. Anecdotally/statically, I recommend choosing a threshold that is below np.mean(distance_profile) - 2 * np.std(distance_profile) as a reasonably informed starting point.

Finally, one final note that the mass function computes the sliding window distances in O(nlogn) (the log is base 2) while a naive sliding window computes the distance profile in O(nm) (where m is the length of your pattern). So, for m > 20, mass will always be faster but the performance difference is essentially imperceptible for shorter patterns. And in case anybody wants to debate this, please keep in mind that mass is JIT-compiled and so the first time the function is called it will be "slow" due to the fact that the function needs to be compiled but it should be very fast thereafter.

like image 27
slaw Avatar answered Sep 15 '25 02:09

slaw