Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pattern Detection in Time Series Data

I have a data frame representing a time series like for example:

timestamp: 1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28...

value: 0|0|3|6|3|3|6|3|3|6 |3 |0 |0 |0 |1 |3 |7 |0 |0 |1 |3 |7 |1 |3 |7 |3 |6 |3 ...

The goal is to classify different patterns (which can be at random positions) and label the values. This means to find the patterns:

  1. 3-6-3
  2. 1-3-7
  3. 0

and to extend the data frame to

timestamp: 1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28...

value: 0|0|3|6|3|3|6|3|3|6 |3 |0 |0 |0 |1 |3 |7 |0 |0 |1 |3 |7 |1 |3 |7 |3 |6 |3 ...

label: c|c|a|a|a|a|a|a|a|a |a |c |c |c |b |b |b |c |c |b |b |b |b |b |b |a |a |a ...

Note that there is no identical length of such a pattern.

The question is what kind of algorithms can be used for this unsupervised learning problem and maybe additionally what libraries/frameworks could be useful to implement such a task.

Thanks in advance!

like image 423
user6188360 Avatar asked Apr 11 '16 13:04

user6188360


People also ask

How do you identify patterns in time series data?

Many methods that recognize patterns in time series do so by first transforming the time series to a more common type of data. Then a classical machine learning algorithm is used to detect and classify the pattern. Visual pattern recognition achieves this by first transforming the data into a picture.

What are the 3 components of the pattern recognition?

The classical model of pattern recognition involves three major operations— representation, feature extraction, and classification.

Which technique are used in pattern recognition?

On the basis of survey, pattern recognition techniques can be categorized into six parts. These include Statistical Techniques, Structural Techniques, Template Matching, Neural Network Approach, Fuzzy Model and Hybrid Models.


1 Answers

My answer deals with question of pattern matching. After successful matching the following algorithm gives the starting and position of a matched sequence as output. You can then use this information for labelling as you described in your question.

I recommend the SPRING algorithm as introduced in the paper:

Stream Monitoring under the Time Warping Distance (Sakurai, Faloutsos, Yamamuro) http://www.cs.cmu.edu/~christos/PUBLICATIONS/ICDE07-spring.pdf

The algorithms core is the DTW Distance (dynamic time warping) and the resulting distance is the DTW distance. The only difference to DTW is optimization as every position of the "stream"(sequence in which your are looking for a match) is a possible starting point for a match - opposed to DTW which computes the total distance matrix for each starting point. You provide a template, a threshold and a stream (the concept is developed for matching from a datastream, yet you can apply the algorithm onto your dataframe by simply looping through it)

Note:

  • Choice of threshold is not trivial and could pose a siginificant challenge - but this would be another question. ( threshold = 0, if you only want to match the exact same sequences, threshold >0, if you want to match similiar sequences as well)
  • If you are looking for several templates/target pattern, then you would have to create several instances of the SPRING class, each for one target pattern.

The following code is my (messy) implementation, which lacks a lot things (such a class definition and so on). Yet it works and should help guiding you to your answer.

I implemented as follows:

#HERE DEFINE 
#1)template consisting of numerical data points 
#2)stream consisting of numerical data points
template = [1, 2, 0, 1, 2]
stream = [1, 1, 0, 1, 2, 3, 1, 0, 1, 2, 1, 1, 1, 2 ,7 ,4 ,5]

#the threshold for the matching process has to be chosen by the user - yet in reality the choice of threshold is a non-trivial problem regarding the quality of the matching process
#Getting Epsilon from the user 
epsilon = input("Please define epsilon: ")
epsilon = float(epsilon)

#SPRING
#1.Requirements
n = len(template)
D_recent = [float("inf")]*(n)
D_now=[0]*(n)
S_recent=[0]*(n)
S_now=[0]*(n)
d_rep=float("inf")
J_s=float("inf")
J_e=float("inf")
check=0

#check/output
matches=[]

#calculation of accumulated distance for each incoming value
def accdist_calc (incoming_value, template,Distance_new, Distance_recent):
    for i in range (len(template)):
        if i == 0:
            Distance_new[i] = abs(incoming_value-template[i])
        else:
            Distance_new[i] = abs(incoming_value-template[i])+min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1])
    return Distance_new

#deduce starting point for each incoming value
def startingpoint_calc (template_length, starting_point_recent, starting_point_new, Distance_new, Distance_recent):
    for i in range (template_length):
            if i == 0:
                #here j+1 instead of j, because of the programm counting from 0 instead of from 1
                starting_point_new[i] = j+1
            else:
                if Distance_new[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
                    starting_point_new[i] = starting_point_new[i-1]                    
                elif Distance_recent[i] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
                    starting_point_new[i] = starting_point_recent[i]                    
                elif Distance_recent[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
                    starting_point_new[i] = starting_point_recent[i-1]                    
    return starting_point_new     

#2.Calculation for each incoming point x.t - simulated here by simply calculating along the given static list
for j in range (len(stream)):

    x = stream[j]    
    accdist_calc (x,template,D_now,D_recent)
    startingpoint_calc (n, S_recent, S_now, D_now, D_recent) 

    #Report any matching subsequence
    if D_now[n-1] <= epsilon:
        if D_now[n-1] <= d_rep:
            d_rep = D_now[n-1]
            J_s = S_now[n-1]            
            J_e = j+1
            print "REPORT: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)              

    #Identify optimal subsequence
    for i in range (n):
        if D_now[i] >= d_rep or S_now[i] > J_e:
            check = check+1
    if check == n:
        print "MATCH: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)
        matches.append(str(d_rep)+","+str(J_s)+","+str(J_e))
        d_rep = float("inf")
        J_s = float("inf")
        J_e = float("inf")
        check = 0 
    else:
        check = 0

    #define the recently calculated distance vector as "old" distance
    for i in range (n):
        D_recent[i] = D_now[i]
        S_recent[i] = S_now[i] 
like image 95
Nikolas Rieble Avatar answered Sep 19 '22 22:09

Nikolas Rieble