Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter numeric column by dictionary of value ranges

Tags:

python

pandas

I have a dataframe and a dictionary:

df = 

VARIABLE VALUE
A        3
A        4
A        60
A        5
B        1
B        2
B        3
B        100
C        0
C        1
# inclusive
accepted_ranges={
A:[3,5],
B:[1,3]
}

And I would like to clean the VALUE column according to the accepted ranges in the dictionary.

df = 

VARIABLE VALUE
A        3
A        4
A        NaN
A        5
B        1
B        2
B        3
B        NaN
C        0
C        1

I have tried: using map() but I can't seem to find a way to use it by VARIABLE groups. apply() would work, I think, however, apply() works really slow with my dataframe (>10M rows). Thank you beforehand.

like image 669
Pab Avatar asked Jan 16 '20 21:01

Pab


1 Answers

Use Series.map to map your dictionary to each VARIABLE. Then we use Series.between to check if each VALUE is between the range.

Finally we use Series.where to convert the False values to NaN

ranges = df['VARIABLE'].map(accepted_ranges)
df['VALUE'] = df['VALUE'].where(df['VALUE'].between(ranges.str[0], ranges.str[1]))

  VARIABLE  VALUE
0        A    3.0
1        A    4.0
2        A    NaN
3        A    5.0
4        B    1.0
5        B    2.0
6        B    3.0
7        B    NaN
8        C    0.0
9        C    1.0

More speed!

The .str accessor can be quite slow and are mostly "loopy" implementations on the background. Especially since you have ~10m rows in your data, this can cause for less efficient code. We can solve this by splitting your accepted_ranges in two dictionarys and thus create two vectors with Series.map:

accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}

ranges1 = df['VARIABLE'].map(accepted_ranges1)
ranges2 = df['VARIABLE'].map(accepted_ranges2)

m1 = df['VALUE'].between(ranges1, ranges2)
m2 = ~df['VARIABLE'].isin(list(accepted_ranges.keys()))

df['VALUE'] = df['VALUE'].where(m1|m2)

Speed comparison

# create example dataframe of 10m rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape

# (10000000, 2)
# Erfan 1
%%timeit
ranges = dfbig['VARIABLE'].map(accepted_ranges)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges.str[0], ranges.str[1]))

10 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Erfan 2
%%timeit
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}

ranges1 = dfbig['VARIABLE'].map(accepted_ranges1)
ranges2 = dfbig['VARIABLE'].map(accepted_ranges2)

dfbig['VALUE'].where(dfbig['VALUE'].between(ranges1, ranges2))

1.03 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# piRSquared
%%timeit
mask = [
    accepted_ranges[k][0] <= v <= accepted_ranges[k][1]
    for k, v in zip(dfbig.VARIABLE, dfbig.VALUE)
]

dfbig.VALUE.where(mask)

3.11 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 75
Erfan Avatar answered Sep 20 '22 16:09

Erfan