Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regression Tests on Arbitrary Number Sequences

I am trying to come up with a method to regression test number sequences.

My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.

Example:

"Good" version:

version    width   height   depth
   1        123      43      302 
   2        122      44      304
   3        120      46      300
   4        124      45      301

"New" version:

   5        121      60      305

In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.

My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.

The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.

I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).

The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.

I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).

So my question is this:

Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)

For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.

like image 842
Alfe Avatar asked Feb 17 '17 15:02

Alfe


People also ask

Can you run a regression with categorical variables?

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model.

What are the 4 conditions for regression analysis?

Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.

Which regression is best for categorical data?

LOGISTIC REGRESSION MODEL It is highly recommended to start from this model setting before more sophisticated categorical modeling is carried out. Dependent variable yi can only take two possible outcomes.


2 Answers

You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.

Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.

To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:

http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb

like image 133
Soeren Sonnenburg Avatar answered Sep 18 '22 18:09

Soeren Sonnenburg


Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)?

The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.

You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.

EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:

1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)

2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"

3.) increase the bin width by 1 and repeat step 1 and 2

4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"

5.) let the test cases for the bad values fail

This algorithm should work for cases such as:

  • a single large peak with some outliers

  • multiple large peaks and some outliers in between

  • a flat distribution with a concentration in a certain region (or in multiple regions)

  • a number sequences where all numbers are equal

Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.

Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.

like image 40
user7291698 Avatar answered Sep 17 '22 18:09

user7291698