Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas how to filter dataframe with certain range of numbers in a column of dataframe

I am trying to come up with a way to filter dataframe so that it contains only certain range of numbers that is needed for further processing. Below is an example dataframe

data_sample = [['part1', 234], ['part2', 224], ['part3', 214],['part4', 114],['part5', 1111],
                ['part6',1067],['part7',1034],['part8',1457],['part9', 789],['part10',1367],
                ['part11',467],['part12',367]
        ]
data_df = pd.DataFrame(data_sample, columns = ['partname', 'sbin'])
data_df['sbin'] = pd.to_numeric(data_df['sbin'], errors='coerce', downcast='integer')

With the above dataframe i want to filter such that any part with sbin in range [200-230] and [1000-1150] and [350-370] and [100-130] are removed.

I have a bigger dataframe with lot more ranges to be removed and hence need a faster way than using below command

data_df.loc[~( ((data_df.sbin >=200) & (data_df.sbin <= 230)) | ((data_df.sbin >=100) & (data_df.sbin <= 130)) | ((data_df.sbin >=350) & (data_df.sbin <= 370))| ((data_df.sbin >=1000) & (data_df.sbin <= 1150)))]

that produces output as below

    partname    sbin
0   part1       234
7   part8       1457
8   part9       789
9   part10      1367
10  part11      467

The above method requires lot of conditions and takes a long time, i would like to know if there is a better way using regex or some other python way that i am not aware off.

any help would be great

like image 306
Nandeep Devendra Avatar asked Dec 30 '22 12:12

Nandeep Devendra


2 Answers

pd.cut works fine here, especially as your intervals are not overlapping:

intervals = pd.IntervalIndex.from_tuples([(200, 230), (1000, 1150), (350, 370), (100, 130)])

# if the values do not fall within the intervals, it is a null
# hence the isna check to keep only the null matches
# thanks to @corralien for the include_lowest=True suggestion

data_df.loc[pd.cut(data_df.sbin, intervals, include_lowest=True).isna()]

   partname  sbin
0     part1   234
7     part8  1457
8     part9   789
9    part10  1367
10   part11   467

like image 81
sammywemmy Avatar answered Jan 04 '23 15:01

sammywemmy


New version

Use np.logical_and and any to select values in ranges and invert the mask to keep other ones.

intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.any([np.logical_and(data_df['sbin'] >= l, data_df['sbin'] <= u)
                                    for l, u in intervals], axis=0)
out = data_df.loc[~m]

Note any can be replaced by np.logical_or.reduce:

intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.logical_or.reduce([np.logical_and(data_df['sbin'] >= l, data_df['sbin'] <= u)
                                    for l, u in intervals])
out = data_df.loc[~m]

Output result:

>>> out
   partname  sbin
0     part1   234
7     part8  1457
8     part9   789
9    part10  1367
10   part11   467

Old version

Not work with float numbers as is

Use np.where and in1d:

intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
m = np.hstack([np.arange(l, u+1) for l, u in intervals])
out = data_df.loc[~np.in1d(data_df['sbin'], m)]

Performance: for 100k records:

data_df = pd.DataFrame({'sbin': np.random.randint(0, 2000, 100000)})

def exclude_range_danimesejo():
    intervals = sorted([(200, 230), (1000, 1150), (350, 370), (100, 130)])
    intervals = np.array(intervals).flatten()
    mask = (np.searchsorted(intervals, data_df['sbin']) % 2 == 0) & ~np.in1d(data_df['sbin'], intervals[::2])
    return data_df.loc[mask]

def exclude_range_sammywemmy():
    intervals = pd.IntervalIndex.from_tuples([(200, 230), (1000, 1150), (350, 370), (100, 130)])
    return data_df.loc[pd.cut(data_df.sbin, intervals, include_lowest=True).isna()]

def exclude_range_corralien():
    intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
    m = np.hstack([np.arange(l, u+1) for l, u in intervals])
    return data_df.loc[~np.in1d(data_df['sbin'], m)]

def exclude_range_corralien2():
    intervals = [(100, 130), (200, 230), (350, 370), (1000, 1150)]
    m = np.any([np.logical_and(data_df['sbin'] >= l, data_df['sbin'] <= u)
                                        for l, u in intervals], axis=0)
    return data_df.loc[~m]
>>> %timeit exclude_range_danimesejo()
2.66 ms ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit exclude_range_sammywemmy()
63.6 ms ± 549 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit exclude_range_corralien()
6.87 ms ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit exclude_range_corralien2()
2.26 ms ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 28
Corralien Avatar answered Jan 04 '23 17:01

Corralien