Regression Tests on Arbitrary Number Sequences

Tags:

I am trying to come up with a method to regression test number sequences.

My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.

Example:

"Good" version:

Click to copy

version    width   height   depth
   1        123      43      302 
   2        122      44      304
   3        120      46      300
   4        124      45      301

"New" version:

Click to copy

   5        121      60      305

In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.

My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.

The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.

I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).

The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.

I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).

So my question is this:

Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)

For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.

842

asked Feb 17 '17 15:02

Alfe

2 Answers

You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.

Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.

To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:

http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb

133

answered Sep 18 '22 18:09

Soeren Sonnenburg

Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)?

The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.

You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.

EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:

1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)

2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"

3.) increase the bin width by 1 and repeat step 1 and 2

4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"

5.) let the test cases for the bad values fail

This algorithm should work for cases such as:

a single large peak with some outliers
multiple large peaks and some outliers in between
a flat distribution with a concentration in a certain region (or in multiple regions)
a number sequences where all numbers are equal

Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.

Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.

answered Sep 17 '22 18:09

user7291698

Related questions
                            
                                worst-case time complexity of str.find in python
                            
                                Python Spark Dataframes: Better way to export groups to text file
                            
                                Convert hOCR to HTML table
                            
                                ipdb with python unittest module
                            
                                Is it possible to mock a C function using python?
                            
                                Parallelizing pandas pyodbc SQL database calls
                            
                                Numpy floating point rounding errors
                            
                                Why do I get this NameError in a generator within a Python class definition?
                            
                                Interactive Ipython Notebooks on Heroku
                            
                                Predicting next word using the language model tensorflow example
                            
                                Execute Python script from Android app in Java?
                            
                                Ignore some modules in autodoc
                            
                                A solution to SQLAlchemy temporary table pain?
                            
                                How do I determine whether a container is infinitely recursive and find its smallest unique container?
                            
                                How can I get stderr from os.popen()?
                            
                                How to generate noisy mock time series or signal (in Python)
                            
                                How to create Pandas Series with Decimal?
                            
                                How to resolve "chromedriver executable needs to be in PATH" error when running Selenium Chrome using virtualenv within PyDev?
                            
                                Allow Python.app on El Capitan (OS X)
                            
                                TensorFlow: getting all states from a RNN

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regression Tests on Arbitrary Number Sequences

Tags:

python

numbers

machine-learning

statistics

regression-testing

Alfe

People also ask

2 Answers

Soeren Sonnenburg

user7291698

Recent Activity

Donate For Us