I have a pandas series with boolean entries. I would like to get a list of indices where the values are True
.
For example the input pd.Series([True, False, True, True, False, False, False, True])
should yield the output [0,2,3,7]
.
I can do it with a list comprehension, but is there something cleaner or faster?
In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.
Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing.
Pandas with PythonLabels can be called indexes and data present in a series called values. If you want to get labels and values individually. Then we can use the index and values attributes of the Series object. Let's take an example and see how these attributes will work.
To convert a pandas Series to a list, simply call the tolist() method on the series which you wish to convert.
Boolean Indexing
>>> s = pd.Series([True, False, True, True, False, False, False, True]) >>> s[s].index Int64Index([0, 2, 3, 7], dtype='int64')
If need a np.array
object, get the .values
>>> s[s].index.values array([0, 2, 3, 7])
np.nonzero
>>> np.nonzero(s) (array([0, 2, 3, 7]),)
np.flatnonzero
>>> np.flatnonzero(s) array([0, 2, 3, 7])
np.where
>>> np.where(s)[0] array([0, 2, 3, 7])
np.argwhere
>>> np.argwhere(s).ravel() array([0, 2, 3, 7])
pd.Series.index
>>> s.index[s] array([0, 2, 3, 7])
filter
>>> [*filter(s.get, s.index)] [0, 2, 3, 7]
list comprehension
>>> [i for i in s.index if s[i]] [0, 2, 3, 7]
As an addition to rafaelc's answer, here are the according times (from quickest to slowest) for the following setup
import numpy as np import pandas as pd s = pd.Series([x > 0.5 for x in np.random.random(size=1000)])
np.where
>>> timeit np.where(s)[0] 12.7 µs ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
np.flatnonzero
>>> timeit np.flatnonzero(s) 18 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
pd.Series.index
The time difference to boolean indexing was really surprising to me, since the boolean indexing is usually more used.
>>> timeit s.index[s] 82.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Boolean Indexing
>>> timeit s[s].index 1.75 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you need a np.array
object, get the .values
>>> timeit s[s].index.values 1.76 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you need a slightly easier to read version <-- not in original answer
>>> timeit s[s==True].index 1.89 ms ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Series.where
<-- not in original answer>>> timeit s.where(s).dropna().index 2.22 ms ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> timeit s.where(s == True).dropna().index 2.37 ms ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pd.Series.mask
<-- not in original answer>>> timeit s.mask(s).dropna().index 2.29 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> timeit s.mask(s == True).dropna().index 2.44 ms ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
list comprehension
>>> timeit [i for i in s.index if s[i]] 13.7 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
filter
>>> timeit [*filter(s.get, s.index)] 14.2 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.nonzero
<-- did not work out of the box for me>>> timeit np.nonzero(s) ValueError: Length of passed values is 1, index implies 1000.
np.argwhere
<-- did not work out of the box for me>>> timeit np.argwhere(s).ravel() ValueError: Length of passed values is 1, index implies 1000.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With