I have a dataframe and a dictionary:
df =
VARIABLE VALUE
A 3
A 4
A 60
A 5
B 1
B 2
B 3
B 100
C 0
C 1
# inclusive
accepted_ranges={
A:[3,5],
B:[1,3]
}
And I would like to clean the VALUE column according to the accepted ranges in the dictionary.
df =
VARIABLE VALUE
A 3
A 4
A NaN
A 5
B 1
B 2
B 3
B NaN
C 0
C 1
I have tried: using map() but I can't seem to find a way to use it by VARIABLE groups. apply() would work, I think, however, apply() works really slow with my dataframe (>10M rows). Thank you beforehand.
Use Series.map to map your dictionary to each VARIABLE. Then we use Series.between to check if each VALUE is between the range.
Finally we use Series.where to convert the False values to NaN
ranges = df['VARIABLE'].map(accepted_ranges)
df['VALUE'] = df['VALUE'].where(df['VALUE'].between(ranges.str[0], ranges.str[1]))
VARIABLE VALUE
0 A 3.0
1 A 4.0
2 A NaN
3 A 5.0
4 B 1.0
5 B 2.0
6 B 3.0
7 B NaN
8 C 0.0
9 C 1.0
The .str accessor can be quite slow and are mostly "loopy" implementations on the background. Especially since you have ~10m rows in your data, this can cause for less efficient code. We can solve this by splitting your accepted_ranges in two dictionarys and thus create two vectors with Series.map:
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}
ranges1 = df['VARIABLE'].map(accepted_ranges1)
ranges2 = df['VARIABLE'].map(accepted_ranges2)
m1 = df['VALUE'].between(ranges1, ranges2)
m2 = ~df['VARIABLE'].isin(list(accepted_ranges.keys()))
df['VALUE'] = df['VALUE'].where(m1|m2)
# create example dataframe of 10m rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape
# (10000000, 2)
# Erfan 1
%%timeit
ranges = dfbig['VARIABLE'].map(accepted_ranges)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges.str[0], ranges.str[1]))
10 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Erfan 2
%%timeit
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}
ranges1 = dfbig['VARIABLE'].map(accepted_ranges1)
ranges2 = dfbig['VARIABLE'].map(accepted_ranges2)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges1, ranges2))
1.03 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# piRSquared
%%timeit
mask = [
accepted_ranges[k][0] <= v <= accepted_ranges[k][1]
for k, v in zip(dfbig.VARIABLE, dfbig.VALUE)
]
dfbig.VALUE.where(mask)
3.11 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With