Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SWAB segmentation algorithm on time series data

I'm trying to understand how to do segmentation on a set of time series data (daily stock prices, temperatures etc.) and came across a book that explains how to do the SWAB (sliding-window and bottom-up) segmentation algorithm, but I don't quite understand it. This segmentation is part of a sonification algorithm. The following text is from "Multimedia Data Mining and Analytics: Disruptive Innovation".

The SWAB segmentation algorithm gets four parameters—the input file (time series data), the output file (segmented data), the maximal error, and the indication of nominal attributes. After running a number of experiments on time series of different sizes with different values for the number of segments, we chose the appropriate default number of segments as follows. 25–50 % of time series size for time series with less than 100 observations, 20–35 % for time series with 100–200 observations, and 15–25 % for time series with more than 200 observations. If the user is not interested to use the default value for any reason, he can enter his own number of segments as a parameter to the algorithm. Starting with the default values for the minimum and the maximum error, we run the segmentation algorithm for the first time and get the minimum number of segments for a given time series (the higher the maximum error, the fewer segments will be found). Then we decrease the maximum error (and so increase the number of found segments) trying to narrow the upper and the lower bounds of error by dividing the base by powers of 2 (like in binary search). Every time after running the segmentation algorithm with the current maximal error, we test whether this value gives a better approximation for the optimal number of segments, and so is a better upper or lower bound for the optimal maximum error. If so, we advance the appropriate bound to this value. In the beginning, only the upper bound is affected. However, once we found the lower bound that provides more segments than the optimum, we continue to look for the optimal number of segments by smaller steps: the next maximum error is the mean between the current upper and lower bounds. As follows from our experience with many different time series databases, the optimal maximal error is usually found within 3–4 iterations. The convergence rate depends on the input time series database itself. If the algorithm has not converged within 20 iterations, we stop searching and proceed with the next sonification steps using the segments found at the 20th iteration.

So for example if I have time series data with 150 observations (which corresponds to 20-35% default number of segments) what are the exact steps I need to take to make the data segmented?

Any help at all is appreciated, thanks.

like image 800
charliekelly Avatar asked Oct 03 '16 20:10

charliekelly


1 Answers

Exact steps

Here is a breif description of the methodology:

The Sliding Window algorithm works by anchoring the left point of a potential segment at the first data point of a time series, then attempting to approximate the data to the right with increasing longer segments. At some point i, the error for the potential segment is greater than the user-specified threshold, so the subsequence from the anchor to i -1 is transformed into a segment. The anchor is moved to location i, and the process repeats until the entire time series has been transformed into a piecewise linear approximation.

Based on that, pseudocode for the algorithm is as follows. See my comments in the code for a description of what exactly is going on.

//function takes a set of points T and a max error
function Sliding_Window(T, max_error)
  anchor = 1;
  while (not finished segmenting time series) {
    i=2;

    //keep making subsets of progressively larger size
    //until the error of the subset is greater than the max error
    //t[anchor: anchor + i] represents the elements of the set
    //from index (anchor) to index (anchor + i)
    //this could be an implemented as an array
    while (calculate_error(T[anchor: anchor+i]) < max_error) { 
      i=i+1;
    }

    //add the newly found segment to the set of all segments
    Seg_TS = concat(Seg_TS, create_segment(T[anchor: anchor + (i-1)]);

    //and increment the anchor in preparation for creating a new segment
    anchor = anchor + i;
  }
}

Definition of "Error"

One thing you seem to be unclear on is the meaning of "error" in this context. The following paragraph explains it nicely:

All segmentation algorithms also need some method to evaluate the quality of fit for a potential segment. A measure commonly used in conjunction with linear regression is the sum of squares, or the residual error. This is calculated by taking all the vertical differences between the best-fit line and the actual data points, squaring them and then summing them together. Another commonly used measure of goodness of fit is the distance between the best fit line and the data point furthest away in the vertical direction.

In other words, there is more than one method that can be used to represent "error" here. Two common ones used in statistics are sum of squares and maximum vertical distance. In theory, you could even write your own function for this as long as it returned a number that was in some way indicative of how well the segment represented a given set of points.

More info on the sum of squares method is here: https://en.wikipedia.org/wiki/Residual_sum_of_squares

If you wanted to implement it yourself, some pseudocode might look like this:

function calculateSegmentErrorUsingSumOfSquares() {
  int sum = 0;
  for each (point in set_approximated_by_segment) {
    int difference = point.y_coordinate - approximation_segment.y_at_x(point.x_coordinate)
    sum = sum + (difference * difference)
  }
  return sum
}

Please note that any method you use may have certain advantages and weaknesses. See Jason's comments below for more info and references, but the key point is this: make sure that whatever error function you choose responds well to the type of data you expect.

Sources

https://www.cs.rutgers.edu/~pazzani/Publications/survey.pdf

like image 129
nhouser9 Avatar answered Nov 18 '22 02:11

nhouser9