I have a DataFrame and I want to fill a new column based on a lookup table. I can't used map
since the values from the lookup table takes many indexes.
import pandas as pd
import numpy as np
d = pd.DataFrame({'I': np.random.randint(3, size=5),
'B0': np.random.choice([True, False], 5),
'B1': np.random.choice([True, False], 5)})
which is my data (actually my data are much bigger):
B0 B1 I
0 True False 0
1 False False 0
2 False False 1
3 True False 1
4 False True 2
then my lookup table:
l = pd.DataFrame({(True, True): [1.1, 2.2, 3.3],
(True, False): [1.3, 2.1, 3.1],
(False, True): [1.2, 2.1, 3.1],
(False, False): [1.1, 2.0, 5.1]}
)
l.index.name = 'I'
l.columns.names = 'B0', 'B1'
l = l.stack(['B0', 'B1'])
which is
I B0 B1
0 False False 1.1
True 1.2
True False 1.3
True 1.1
1 False False 2.0
True 2.1
True False 2.1
True 2.2
2 False False 5.1
True 3.1
True False 3.1
True 3.3
so I want to add a column w
from my data querying the loopup table on the values (I, B0, B1)
. I am using apply:
d['w'] = d.apply(lambda x: l[x['I'], x['B0'], x['B1']], axis=1)
and it works:
B0 B1 I w
0 True False 0 1.3
1 False False 0 1.1
2 False False 1 2.0
3 True False 1 2.1
4 False True 2 3.1
the problem is that it is terribly slow. How to speed up this?
As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead. Using map as a vectorized solution gives even faster results.
You can speed up the execution even faster by using another trick: making your pandas' dataframes lighter by using more efficent data types. As we know that df only contains integers from 1 to 10, we can then reduce the data type from 64 bits to 16 bits. See how we reduced the size of our dataframe from 38MB to 9.5MB.
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
applymap() is only available in DataFrame and used for element-wise operation across the whole DataFrame. It has been optimized and some cases work much faster than apply() , but it's good to compare it with apply() before going for any heavier operation.
This should be quicker
find_these = list(zip(d.I, d.B0, d.B1))
d.assign(w=l.loc[find_these].values)
B0 B1 I w
0 True False 0 1.3
1 False False 0 1.1
2 False False 1 2.0
3 True False 1 2.1
4 False True 2 3.1
With join
d.join(l.rename('w'), on=['I', 'B0', 'B1'])
B0 B1 I w
0 True False 0 1.3
1 False False 0 1.1
2 False False 1 2.0
3 True False 1 2.1
4 False True 2 3.1
Timing
small data
%%timeit
find_these = list(zip(d.I, d.B0, d.B1))
d.assign(w=l.loc[find_these].values)
100 loops, best of 3: 1.98 ms per loop
%timeit d.assign(w=d.apply(lambda x: l[x['I'], x['B0'], x['B1']], axis=1))
100 loops, best of 3: 11.8 ms per loop
%timeit d.join(l.rename('w'), on=['I', 'B0', 'B1'])
100 loops, best of 3: 1.99 ms per loop
%timeit d.merge(l.reset_index())
100 loops, best of 3: 2.89 ms per loop
we can merge d
with a flat (after applying reset_index()
) l
:
In [5]: d.merge(l.reset_index())
Out[5]:
B0 B1 I 0
0 True False 0 1.3
1 True False 0 1.3
2 False True 0 1.2
3 False False 0 1.1
4 False True 2 3.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With