I've a pandas data frame with six columns and i know there are some outliers in each column.So i have these two lines of code which is pretty much doing what i want to do. But it's removing outliers from only one column of the dataframe. so what if i want to remove outliers from each column together??
df = pd.DataFrame({'stlines':np.random.normal(size=533)})
df = df[np.abs(df.stlines-df.stlines.mean()) <= (2*df.stlines.std())]
what would be the elegant way to do this?
The problem is that your outliers in each column may happen for varying rows(records). I'd advise you be happy with substituting np.nan
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.normal(size=(20, 8)),
columns=list('ABCDEFGH')
)
df
A B C D E F G H
0 -2.129724 -1.268466 -1.970500 -2.259055 -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 1.494226 -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 -2.137846 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 1.633570
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 2.222507 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 1.690625 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
pandas.DataFrame.mask
df.mask((df - df.mean()).abs() > 2 * df.std())
A B C D E F G H
0 NaN -1.268466 NaN NaN -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 NaN -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 NaN 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 NaN
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 NaN 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 NaN 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
dropna
If you only want rows for which no outliers exist for any column, you could follow up the above with dropna
df.mask((df - df.mean()).abs() > 2 * df.std()).dropna()
A B C D E F G H
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520
Assuming you have multiple columns using all
df[df.apply(lambda x :(x-x.mean()).abs()<(2*x.std()) ).all(1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With