How to find values below (or above) average

Question

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.

from StringIO import StringIO

myst="""01/01/2016  8781262
01/02/2016  8958598
01/03/2016  8787628
01/04/2016  9770861
01/05/2016  8409410
01/06/2016  8924784
01/07/2016  8597500
01/08/2016  6436862
01/09/2016  1542677
"""
u_cols=['month', 'count']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='	', names = u_cols)

Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?

This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.

piRSquared · Accepted Answer

A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.

Define a zscore function

def zscore(s):
    return (s - np.mean(s)) / np.std(s)

Apply it to the count column

zscore(df['count'])

0    0.414005
1    0.488906
2    0.416694
3    0.831981
4    0.256946
5    0.474624
6    0.336390
7   -0.576197
8   -2.643349
Name: count, dtype: float64

Notice that the September observation is 2.6 standard deviations away.

Use abs and gt to identify outliers

zscore(df['count']).abs().gt(2)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
Name: count, dtype: bool

Again, September comes back true.

Tie it all together to filter your original dataframe

df[zscore(df['count']).abs().gt(2)]

enter image description here

filter the other way

df[zscore(df['count']).abs().le(2)]

enter image description here

fr_andres · Answer

first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),

There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

But on the other side:

In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.

So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.

STRAIGHTFORWARD APPROACH

If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).

In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).

Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:

High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.

ALTERNATIVES

If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).

In this case I would regard two different approaches, depending again on how does your data look like:

Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.
Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.

I hope it helps! Cheers

How to find values below (or above) average

Tags:

python

pandas

dataframe

shantanuo

2 Answers

piRSquared

fr_andres

Recent Activity

Donate For Us

How to find values below (or above) average

Tags:

python

pandas

dataframe

shantanuo

2 Answers

piRSquared

fr_andres

Related questions

Recent Activity

Donate For Us