Outliers using RPCA

Tags:

I read about using RPCA to find outliers on time series data. I have an idea about the fundamentals of what RPCA is about and the theory. I got a Python library that does RPCA and pretty much got two matrices as the output (L and S), a low rank approximation of the input data and a sparse matrix.

Input data:(rows being a day and 10 features as columns.)

Click to copy

DAY 1 - 100,300,345,126,289,387,278,433,189,153  
DAY 2 - 300,647,245,426,889,987,278,133,295,153  
DAY 3 - 200,747,145,226,489,287,378,1033,295,453

Output obtained :

Click to copy

L  
[[ 125.20560531  292.91525518   92.76132814  141.33797061  282.93586313
   185.71134917  199.48789246   96.04089205  192.11501055  118.68811072]  
 [ 174.72737183  408.77013914  129.45061871  197.24046765  394.84366245
   259.16456278  278.39005349  134.0273274   268.1010231   165.63205458]  
 [ 194.38951303  454.76920678  144.01774873  219.43601655  439.27557808
   288.32845493  309.71739782  149.10947628  298.27053871  184.27069609]]

S  
[[ -25.20560531    0.          252.23867186   -0.            0.
   201.28865083   78.51210754  336.95910795   -0.           34.31188928]  
 [ 125.27262817  238.22986086  115.54938129  228.75953235  494.15633755
   727.83543722   -0.           -0.           26.8989769    -0.        ]  
 [   0.          292.23079322   -0.            0.           49.72442192
    -0.           68.28260218  883.89052372    0.          268.72930391]]

Inference: (My question)

Now how do I infer the points that could be classified as outliers. For ex. by looking at the data, we could say 1033 looks like an outlier. The corresponding entry in S matrix is 883.89052372 which is more compared to other entries in S. Could the notion of having a fixed threshold to find the deviations of S matrix entries from the corresponding original value in the input matrix be used to determine that the point is an outlier ? Or am I completely understanding the concept of RPCA wrong ? TIA for your help.

893

asked Dec 22 '16 19:12

Aragorn

1 Answers

You understood the concept of robust PCA (RPCA) correctly: The sparse matrix S contains the outliers. However, S will often contain many observations (non-zero values) you might not classify as anomalies yourself. As you suggest it is therefore a good idea to filter out these points.

Applying a fixed threshold to identify relevant outliers could potentially work for one dataset. However, using the threshold on many datasets might give poor results if there are changes in mean and variance of the underlying distribution.

Ideally you calculate an anomaly score and then classify the outliers based on that score. A simple method (and often used in outlier detection) is to see if your data point (potential outlier) is at the tail of your assumed distribution. For example, if you assume your distribution is Gaussian you can calculate the Z-score (z):

z = (x-μ)/σ,

where μ is the mean and σ is the standard deviation.

You can then apply a threshold to the calculated Z-score in order to identify an outlier. For example: if for a given observation z > 3, the data point is an outlier. This means your observation is more than 3 standard deviations from the mean and it is in the 0.1% tail of the Gaussian distribution. This approach is more robust to changes in the data than using a threshold on the non-standardized values. Furthermore tuning the z value at which you classify the outlier is simpler than finding a real scale value (883.89052372 in your case) for each dataset.

150

answered Oct 01 '22 15:10

O. Gindele

Related questions
                            
                                Python3 src encodings of Emojis
                            
                                traceback shows only one line of a multiline command
                            
                                Moving x-axis in matplotlib during real time plot (python)
                            
                                What are valid values for platforms in python setup.py?
                            
                                Plotly + iPython Notebook - Plots Disappear on Reopen
                            
                                Pandas to_sql() performance - why is it so slow?
                            
                                Use context manager as a function
                            
                                Simple MLP time series training yields unexpeced mean line results
                            
                                pip install bs4 giving _socketobject error
                            
                                How to make test case fail if a django template has a rendering error that would silently fail in production
                            
                                How do I pickle a dictionary containing a module & class?
                            
                                Generate N positive integers within a range adding up to a total in python
                            
                                Sending attachment in HTML email with Python
                            
                                `TypeError: argument 2 must be a connection, cursor or None` in Psycopg2
                            
                                Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
                            
                                Selenium Remote Webdriver with remote profile
                            
                                django.db.utils.OperationalError: server closed the connection unexpectedly
                            
                                AWS Redis + uWSGI behind NGINX - high load
                            
                                Fonts Corrupted
                            
                                How to find all uses of a python function or variable in a python package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Outliers using RPCA

Tags:

python

machine-learning

statistics

outliers

pca

Aragorn

People also ask

1 Answers

O. Gindele

Recent Activity

Donate For Us