Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete rows preceeding and following a row containing NaN in Python?

I am trying to clean experimental data using python with numpy and pandas. Some of my measurements are implausible. I want to remove these measurements and the 2 preceeding and 2 following measurements from the same sample. I am trying to find an elegant way to achieve this without using a for loop as my dataframes are quite large.

My data:

>>>df

    Date    Time    Sample  Measurement
index
7737    2019-04-15  06:40:00    A   6.560
7739    2019-04-15  06:50:00    A   1.063
7740    2019-04-15  06:55:00    A   1.136
7741    2019-04-15  07:00:00    A   1.301
7742    2019-04-15  07:05:00    A   1.435
7743    2019-04-15  07:10:00    A   1.704
7744    2019-04-15  07:15:00    A   1.961
7745    2019-04-15  07:20:00    A   2.023
7746    2019-04-15  07:25:00    A   6.284
7747    2019-04-15  07:30:00    A   2.253
7748    2019-04-15  07:35:00    A   6.549
7749    2019-04-15  07:40:00    A   2.591
7750    2019-04-15  07:45:00    A   6.321
7752    2019-04-15  07:55:00    A   0.937
7753    2019-04-15  08:00:00    B   0.372
7754    2019-04-15  08:05:00    B   0.382
7755    2019-04-15  08:10:00    B   0.390
7756    2019-04-15  08:15:00    B   0.455
7757    2019-04-15  08:20:00    B   6.499


import numpy as np
import pandas as pd 

df['Measurement'] = np.where(df['Measurement']>6.0, np.nan, df['Measurement'])

gives

>>>df

    Date    Time    Sample  Measurement
index
7737    2019-04-15  06:40:00    A   NaN
7739    2019-04-15  06:50:00    A   1.063
7740    2019-04-15  06:55:00    A   1.136
7741    2019-04-15  07:00:00    A   1.301
7742    2019-04-15  07:05:00    A   1.435
7743    2019-04-15  07:10:00    A   1.704
7744    2019-04-15  07:15:00    A   1.961
7745    2019-04-15  07:20:00    A   2.023
7746    2019-04-15  07:25:00    A   NaN
7747    2019-04-15  07:30:00    A   2.253
7748    2019-04-15  07:35:00    A   NaN
7749    2019-04-15  07:40:00    A   2.591
7750    2019-04-15  07:45:00    A   NaN
7752    2019-04-15  07:55:00    A   0.937
7753    2019-04-15  08:00:00    B   0.372
7754    2019-04-15  08:05:00    B   0.382
7755    2019-04-15  08:10:00    B   0.390
7756    2019-04-15  08:15:00    B   0.455
7757    2019-04-15  08:20:00    B   NaN

I deleted rows using

df= df[np.isfinite(df['Measurement'])]

The result I am trying to obtain after removing the 2 rows preceeding and following a row containing NaN within a sample (note that 7753 has to stay as this measurement belongs to sample B).


    Date    Time    Sample  Measurement
index
7741    2019-04-15  07:00:00    A   1.301
7742    2019-04-15  07:05:00    A   1.435
7743    2019-04-15  07:10:00    A   1.704
7753    2019-04-15  08:00:00    B   0.372
7754    2019-04-15  08:05:00    B   0.382


like image 805
drosophilately Avatar asked May 24 '19 12:05

drosophilately


3 Answers

We can make mark all indices which are two places before or after the NaN, then replace their values with NaN as well:

# Get indices of NaN's
idxnull = df[df['Measurement'].isnull()].index

a = [range(x+2) if x==0 else range(x-2, x) if x==idxnull.max() else range(x-2, x+2) for x in idxnull]

for rng in a:
    df.loc[rng, 'Measurement'] = np.NaN

df.dropna(inplace=True)
df = df.iloc[1:]

    Index        Date      Time Sample  Measurement
3    7741  2019-04-15  07:00:00      A        1.301
4    7742  2019-04-15  07:05:00      A        1.435
5    7743  2019-04-15  07:10:00      A        1.704
14   7753  2019-04-15  08:00:00      B        0.372
15   7754  2019-04-15  08:05:00      B        0.382

The list comprehension looks quite difficult, but its the following:

for x in idxnull:
    if x > 0:
        range(x-2, x+2)
    elif x==idxnull.max():
        range(x-2)
    else:
        range(x+2)
like image 116
Erfan Avatar answered Oct 18 '22 09:10

Erfan


First I mark the invalids as you would, other places are NaN, then bfill, ffill:

df['invalid'] = np.where(df.Measurement.gt(6), True, np.nan)
groups = df.groupby('Sample')

df['invalid'] = groups.invalid.ffill(limit=2)
df['invalid'] = groups.invalid.bfill(limit=2)

# drop the invalids:
df = df[df.invalid.isna()]

# drop the invalid column:
df.drop('invalid', axis=1, inplace=True)

Output:

        Date        Time    Sample  Measurement
Index               
7741    2019-04-15  07:00:00    A   1.301
7742    2019-04-15  07:05:00    A   1.435
7743    2019-04-15  07:10:00    A   1.704
7753    2019-04-15  08:00:00    B   0.372
7754    2019-04-15  08:05:00    B   0.382
like image 40
Quang Hoang Avatar answered Oct 18 '22 09:10

Quang Hoang


df.loc[((df['Measurement']>6) & (df['Sample'] == 'A')),'drop'] = 'Y'

# making sure B readings dont get dropped

l = df.index[df['drop'] == 'Y'].tolist()
l_drop = []
for i in l:
    l_drop.append(i-1)
    l_drop.append(i+1)
    l_drop.append(i+2)

df.drop(df.index[l_drop],inplace=True)

No iterating over the dataframe.

like image 22
Sid Avatar answered Oct 18 '22 08:10

Sid