I have a dataframe and a dictionary:
df =
VARIABLE VALUE
A 3
A 4
A 60
A 5
B 1
B 2
B 3
B 100
C 0
C 1
# inclusive
accepted_ranges={
A:[3,5],
B:[1,3]
}
And I would like to clean the VALUE column according to the accepted ranges in the dictionary.
df =
VARIABLE VALUE
A 3
A 4
A NaN
A 5
B 1
B 2
B 3
B NaN
C 0
C 1
I have tried: using map()
but I can't seem to find a way to use it by VARIABLE groups. apply()
would work, I think, however, apply()
works really slow with my dataframe (>10M rows). Thank you beforehand.
Use Series.map
to map your dictionary to each VARIABLE
. Then we use Series.between
to check if each VALUE
is between the range.
Finally we use Series.where
to convert the False
values to NaN
ranges = df['VARIABLE'].map(accepted_ranges)
df['VALUE'] = df['VALUE'].where(df['VALUE'].between(ranges.str[0], ranges.str[1]))
VARIABLE VALUE
0 A 3.0
1 A 4.0
2 A NaN
3 A 5.0
4 B 1.0
5 B 2.0
6 B 3.0
7 B NaN
8 C 0.0
9 C 1.0
The .str
accessor can be quite slow and are mostly "loopy" implementations on the background. Especially since you have ~10m rows in your data, this can cause for less efficient code. We can solve this by splitting your accepted_ranges
in two dictionarys and thus create two vectors with Series.map
:
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}
ranges1 = df['VARIABLE'].map(accepted_ranges1)
ranges2 = df['VARIABLE'].map(accepted_ranges2)
m1 = df['VALUE'].between(ranges1, ranges2)
m2 = ~df['VARIABLE'].isin(list(accepted_ranges.keys()))
df['VALUE'] = df['VALUE'].where(m1|m2)
# create example dataframe of 10m rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape
# (10000000, 2)
# Erfan 1
%%timeit
ranges = dfbig['VARIABLE'].map(accepted_ranges)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges.str[0], ranges.str[1]))
10 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Erfan 2
%%timeit
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}
ranges1 = dfbig['VARIABLE'].map(accepted_ranges1)
ranges2 = dfbig['VARIABLE'].map(accepted_ranges2)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges1, ranges2))
1.03 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# piRSquared
%%timeit
mask = [
accepted_ranges[k][0] <= v <= accepted_ranges[k][1]
for k, v in zip(dfbig.VARIABLE, dfbig.VALUE)
]
dfbig.VALUE.where(mask)
3.11 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With