i would like to find what the best way to detect outliers is. here is the problem and some things which probably will not work. let's say we want to fish out some quasi-uniform data from a dirty varchar(50) column in mysql. let's start by doing an analysis by string length.
| strlen | freq |
| 0 | 2312 |
| 3 | 45 |
| 9 | 75 |
| 10 | 15420 |
| 11 | 395 |
| 12 | 114 |
| 19 | 27 |
| 20 | 1170 |
| 21 | 33 |
| 35 | 9 |
what i would like to do is devise an algorithm to determine which string length has a high probability of being purposefully unique rather than being typeo's or random garbage. this field has the possibility of being an "enum" type, so there can be several frequency spikes for valid values. clearly 10 and 20 are valid, 0 is just omitted data. 35 and 3 might be some random trash despite both being very different in frequency. 19 and 21 might be type-os around the 20 format. 11 might be type-os for 10, but what about 12?
it seems simply using occurrence frequency % is not enough. there need to hotspots of higher "just an error" probability around the obvious outliers.
also, having a fixed threshold fails when there are 15 unique lengths which can vary by between 5-20 chars, each with between 7% - 20% occurrence.
standard deviation will not work because it relies on the mean. median absolute deviation probably wont work because you can have a high frequency outlier that cannot be discarded.
yes there will be other params for cleaning the data in the code, but length seems to very quickly pre-filter and classify fields with any amount of structure.
are there any known methods which would work efficiently? i'm not very familiar with Bayesian filters or machine learning but maybe they can help?
thanks! leon
Sounds like anomaly detection is the way the to go. Anomaly detection is a kind of machine learning that is used to find outliers. It comes in a couple of varieties, including supervised and unsupervised. In supervised learning, the algorithm is training using examples of outliers. In unsupervised learning, the algorithm attempts to find outliers without any examples. Here are a couple of links to start out:
http://en.wikipedia.org/wiki/Anomaly_detection
http://s3.amazonaws.com/mlclass-resources/docs/slides/Lecture15.pdf
I didn't find any links to readily available libraries. Something like MATLAB, or its free cousin, Octave, might be a nice way to if you can't find an anomaly detection library in your language of choice. https://goker.wordpress.com/tag/anomaly-detection/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With