I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:
------- ------------- ------------ ---------- -------------------
Cluster %Availability Requests/Sec Errors/Sec %Memory_Utilization
------- ------------- ------------ ---------- -------------------
ams-a 98.099 1012 678 91
bos-a 98.099 1111 12 91
bos-b 55.123 1513 576 22
lax-a 99.110 988 10 89
pdx-a 98.123 1121 11 90
ord-b 75.005 1301 123 100
sjc-a 99.020 1000 10 88
...(so on)...
So in list form, it might look like:
[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]
My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...).
Thanks in advance!
The outlier formula designates outliers based on an upper and lower boundary (you can think of these as cutoff points). Any value that is 1.5 x IQR greater than the third quartile is designated as an outlier and any value that is 1.5 x IQR less than the first quartile is also designated as an outlier.
Using the Interquartile Rule to Find OutliersMultiply the interquartile range (IQR) by 1.5 (a constant used to discern outliers). Add 1.5 x (IQR) to the third quartile. Any number greater than this is a suspected outlier. Subtract 1.5 x (IQR) from the first quartile. Any number less than this is a suspected outlier.
One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry http://en.wikipedia.org/wiki/Box_plot). In R, there's a boxplot
function to do just that.
One way to discard/identify outliers programmatically is to use the MAD, or Median Absolute Deviation. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With