I have a list with numbers in it. I want to create a bool mask of this list (or array, doesn't matter) for every unique element of this list.
In the example below, I want to create four masks of length len(labels). The first mask has True at position i, if labels[i]==0, the second one has True at position i, if labels[i]==1 etc.
I tried to do this with pandas and the .isin method in a loop. However, this is too slow for my purpose since this is called many times in my algorithm and the list of labels can be very long so that the loop is not effective. How can I make this faster?
labels = [0,0,1,1,3,3,3,1,2,1,0,0]
d = dict()
y = pd.Series(labels)
for i in set(labels):
d[i] = y.isin([i])
Method 1
Using list and set
In [989]: {x: [x==l for l in labels] for x in set(labels)}
Out[989]:
{0: [True, True, False, False, False, False, False, False, False, False, True, True],
1: [False, False, True, True, False, False, False, True, False, True, False, False],
2: [False, False, False, False, False, False, False, False, True, False, False, False],
3: [False, False, False, False, True, True, True, False, False, False, False, False]}
If you want it as dataframe
In [994]: pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
Out[994]:
0 1 2 3
0 True False False False
1 True False False False
2 False True False False
3 False True False False
4 False False False True
5 False False False True
6 False False False True
7 False True False False
8 False False True False
9 False True False False
10 True False False False
11 True False False False
Method 2
Using pd.get_dummies, if you anyway a series you can
In [997]: pd.get_dummies(y).astype(bool)
Out[997]:
0 1 2 3
0 True False False False
1 True False False False
2 False True False False
3 False True False False
4 False False False True
5 False False False True
6 False False False True
7 False True False False
8 False False True False
9 False True False False
10 True False False False
11 True False False False
Benchmarks
Small
In [1002]: len(labels)
Out[1002]: 12
In [1003]: %timeit pd.get_dummies(y).astype(bool)
1000 loops, best of 3: 476 µs per loop
In [1004]: %timeit pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
1000 loops, best of 3: 580 µs per loop
In [1005]: %timeit pd.DataFrame({x : (y == x) for x in y.unique()})
1000 loops, best of 3: 1.15 ms per loop
Large
In [1011]: len(labels)
Out[1011]: 12000
In [1012]: %timeit pd.get_dummies(y).astype(bool)
1000 loops, best of 3: 875 µs per loop
In [1013]: %timeit pd.DataFrame({x: [x==l for l in labels] for x in set(labels)})
100 loops, best of 3: 4.97 ms per loop
In [1014]: %timeit pd.DataFrame({x : (y == x) for x in y.unique()})
1000 loops, best of 3: 1.32 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With