Filter numeric column by dictionary of value ranges

Question

I have a dataframe and a dictionary:

df = 

VARIABLE VALUE
A        3
A        4
A        60
A        5
B        1
B        2
B        3
B        100
C        0
C        1

# inclusive
accepted_ranges={
A:[3,5],
B:[1,3]
}

And I would like to clean the VALUE column according to the accepted ranges in the dictionary.

df = 

VARIABLE VALUE
A        3
A        4
A        NaN
A        5
B        1
B        2
B        3
B        NaN
C        0
C        1

I have tried: using map() but I can't seem to find a way to use it by VARIABLE groups. apply() would work, I think, however, apply() works really slow with my dataframe (>10M rows). Thank you beforehand.

Erfan · Accepted Answer

Use Series.map to map your dictionary to each VARIABLE. Then we use Series.between to check if each VALUE is between the range.

Finally we use Series.where to convert the False values to NaN

ranges = df['VARIABLE'].map(accepted_ranges)
df['VALUE'] = df['VALUE'].where(df['VALUE'].between(ranges.str[0], ranges.str[1]))

  VARIABLE  VALUE
0        A    3.0
1        A    4.0
2        A    NaN
3        A    5.0
4        B    1.0
5        B    2.0
6        B    3.0
7        B    NaN
8        C    0.0
9        C    1.0

More speed!

The .str accessor can be quite slow and are mostly "loopy" implementations on the background. Especially since you have ~10m rows in your data, this can cause for less efficient code. We can solve this by splitting your accepted_ranges in two dictionarys and thus create two vectors with Series.map:

accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}

ranges1 = df['VARIABLE'].map(accepted_ranges1)
ranges2 = df['VARIABLE'].map(accepted_ranges2)

m1 = df['VALUE'].between(ranges1, ranges2)
m2 = ~df['VARIABLE'].isin(list(accepted_ranges.keys()))

df['VALUE'] = df['VALUE'].where(m1|m2)

Speed comparison

# create example dataframe of 10m rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape

# (10000000, 2)

# Erfan 1
%%timeit
ranges = dfbig['VARIABLE'].map(accepted_ranges)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges.str[0], ranges.str[1]))

10 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Erfan 2
%%timeit
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}

ranges1 = dfbig['VARIABLE'].map(accepted_ranges1)
ranges2 = dfbig['VARIABLE'].map(accepted_ranges2)

dfbig['VALUE'].where(dfbig['VALUE'].between(ranges1, ranges2))

1.03 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# piRSquared
%%timeit
mask = [
    accepted_ranges[k][0] <= v <= accepted_ranges[k][1]
    for k, v in zip(dfbig.VARIABLE, dfbig.VALUE)
]

dfbig.VALUE.where(mask)

3.11 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Filter numeric column by dictionary of value ranges

Tags:

python

pandas

Pab

1 Answers

More speed!

Speed comparison

Erfan

Recent Activity

Donate For Us

Filter numeric column by dictionary of value ranges

Tags:

python

pandas

Pab

1 Answers

More speed!

Speed comparison

Erfan

Related questions

Recent Activity

Donate For Us