Remove Outliers in Pandas DataFrame using Percentiles [duplicate]

Tags:

I have a DataFrame df with 40 columns and many records.

df:

User_id | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |...| Col39

For each column except the user_id column I want to check for outliers and remove the whole record, if an outlier appears.

For outlier detection on each row I decided to simply use 5th and 95th percentile (I know it's not the best statistical way):

Code what I have so far:

P = np.percentile(df.Col1, [5, 95])
new_df = df[(df.Col1 > P[0]) & (df.Col1 < P[1])]

Question: How can I apply this approach to all columns (except User_id) without doing this by hand? My goal is to get a dataframe without records that had outliers.

Thank you!

367

asked Mar 06 '16 14:03

Mi Funk

2 Answers

Use this code and don't waste your time:

Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1  df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

in case you want specific columns:

cols = ['col_1', 'col_2'] # one or more  Q1 = df[cols].quantile(0.25) Q3 = df[cols].quantile(0.75) IQR = Q3 - Q1  df = df[~((df[cols] < (Q1 - 1.5 * IQR)) |(df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

134

answered Oct 11 '22 10:10

E.Zolduoarrati

The initial dataset.

print(df.head())

   Col0  Col1  Col2  Col3  Col4  User_id
0    49    31    93    53    39       44
1    69    13    84    58    24       47
2    41    71     2    43    58       64
3    35    56    69    55    36       67
4    64    24    12    18    99       67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

       Col0   Col1  Col2   Col3   Col4
0.05   2.00   3.00   6.9   3.95   4.00
0.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                    (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
1       47    69    13    84    58    24
3       67    35    56    69    55    36
5        9    95    79    44    45    69
6       83    69    41    66    87     6
9       87    50    54    39    53    40

Checking result

print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
0       44    49    31   NaN    53    39
1       47    69    13    84    58    24
2       64    41    71   NaN    43    58
3       67    35    56    69    55    36
4       67    64    24    12    18   NaN

print(filt_df.describe())

          User_id       Col0       Col1       Col2       Col3       Col4
count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

answered Oct 11 '22 08:10

Romain

Related questions
                            
                                Using BeautifulSoup's findAll to search html element's innerText to get same result as searching attributes?
                            
                                Dynamically customize django admin columns?
                            
                                Real-time data on webpage with jQuery
                            
                                TypeError: coercing to Unicode: need string or buffer, User found
                            
                                Installing dateutils on OS X. How can I install to a different version of Python
                            
                                setup.py adding options (aka setup.py --enable-feature )
                            
                                Python/C "defs" file - what is it?
                            
                                something like gimp "fuzzy select" in python/PIL
                            
                                Python get raw_input but manually decide when string is done
                            
                                Adding variably named fields to Python classes
                            
                                Django - Website Home Page
                            
                                Python - How to call bash commands with pipe?
                            
                                Postgres/psycopg2 - Inserting array of strings
                            
                                installing pip using get_pip.py SNIMissingWarning
                            
                                Remove background of the image using opencv Python
                            
                                How to print the stack trace of an exception object in Python?
                            
                                Enum vs String as a parameter in a function
                            
                                Separating html and JavaScript in Flask [duplicate]
                            
                                Why do some methods use dot notation and others don't?
                            
                                Django Admin nested inline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove Outliers in Pandas DataFrame using Percentiles [duplicate]

Tags:

python

pandas

outliers