I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this: <pre class="prettyprint"><code>------- ------------- ------------ ---------- ------------------- Cluster %Availability Requests/Sec Errors/Sec %Memory_Utilization ------- ------------- ------------ ---------- ------------------- ams-a 98.099 1012 678 91 bos-a 98.099 1111 12 91 bos-b 55.123 1513 576 22 lax-a 99.110 988 10 89 pdx-a 98.123 1121 11 90 ord-b 75.005 1301 123 100 sjc-a 99.020 1000 10 88 ...(so on)... </code></pre> So in list form, it might look like: <pre class="prettyprint"><code>[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...] </code></pre> My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...). Thanks in advance!

One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry http://en.wikipedia.org/wiki/Box_plot). In R, there's a <code>boxplot</code> function to do just that. One way to discard/identify outliers programmatically is to use the MAD, or Median Absolute Deviation. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.

Finding outliers in a data set

Tags:

python

statistics

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:

-------  -------------  ------------  ----------  -------------------
Cluster  %Availability  Requests/Sec  Errors/Sec  %Memory_Utilization
-------  -------------  ------------  ----------  -------------------
ams-a    98.099          1012         678          91
bos-a    98.099          1111         12           91
bos-b    55.123          1513         576          22
lax-a    99.110          988          10           89
pdx-a    98.123          1121         11           90
ord-b    75.005          1301         123          100
sjc-a    99.020          1000         10           88
...(so on)...

So in list form, it might look like:

[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]

My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...).

Thanks in advance!

416

asked Jan 05 '11 16:01

septagram

1 Answers

One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry http://en.wikipedia.org/wiki/Box_plot). In R, there's a boxplot function to do just that.

One way to discard/identify outliers programmatically is to use the MAD, or Median Absolute Deviation. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.

192

answered Sep 21 '22 06:09

Prasad Chalasani

Related questions
                            
                                How to make two django projects share the same database
                            
                                How can I update pip in PyCharm when I have two versions of python?
                            
                                TCP client/server with sockets, server sending files to clients, client hangs, Python
                            
                                How to complete/close a contour in python opencv?
                            
                                Tensorflow model for OCR
                            
                                Django Rest Framework: How to enable swagger docs for function based views
                            
                                How to set k-Means clustering labels from highest to lowest with Python?
                            
                                Class wise precision and recall for multi class classification in Tensorflow?
                            
                                Is tf.layers.dense a single layer?
                            
                                PyCharm: always mark venv directory as excluded
                            
                                Reading stdout process in real time
                            
                                Reticulate - Running python chunks in Rmarkdown
                            
                                Why should asyncio.StreamWriter.drain be explicitly called?
                            
                                Python pandas to_csv zip format
                            
                                How use pytest to unit test sqlalchemy orm classes
                            
                                Why is a=a*100 almost two times faster than a*=100? [duplicate]
                            
                                How do you create a simple Google Talk Client using the Twisted Words Python library?
                            
                                Why do I have to specify my own class when using super(), and is there a way to get around it?
                            
                                What is the best open source solution for storing time series data? [closed]
                            
                                Threading in python: retrieve return value when using target= [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With