Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting a list of indices where pandas boolean series is True

I have a pandas series with boolean entries. I would like to get a list of indices where the values are True.

For example the input pd.Series([True, False, True, True, False, False, False, True])

should yield the output [0,2,3,7].

I can do it with a list comprehension, but is there something cleaner or faster?

like image 854
James McKeown Avatar asked Sep 04 '18 19:09

James McKeown


People also ask

How do you access the index of a Pandas Series?

In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.

Is boolean indexing possible in DataFrame?

Boolean indexing helps us to select the data from the DataFrames using a boolean vector. We need a DataFrame with a boolean index to use the boolean indexing.

Can Pandas Series have index?

Pandas with PythonLabels can be called indexes and data present in a series called values. If you want to get labels and values individually. Then we can use the index and values attributes of the Series object. Let's take an example and see how these attributes will work.

How do I get my Pandas Series list?

To convert a pandas Series to a list, simply call the tolist() method on the series which you wish to convert.


2 Answers

Using Boolean Indexing

>>> s = pd.Series([True, False, True, True, False, False, False, True]) >>> s[s].index Int64Index([0, 2, 3, 7], dtype='int64') 

If need a np.array object, get the .values

>>> s[s].index.values array([0, 2, 3, 7]) 

Using np.nonzero

>>> np.nonzero(s) (array([0, 2, 3, 7]),) 

Using np.flatnonzero

>>> np.flatnonzero(s) array([0, 2, 3, 7]) 

Using np.where

>>> np.where(s)[0] array([0, 2, 3, 7]) 

Using np.argwhere

>>> np.argwhere(s).ravel() array([0, 2, 3, 7]) 

Using pd.Series.index

>>> s.index[s] array([0, 2, 3, 7]) 

Using python's built-in filter

>>> [*filter(s.get, s.index)] [0, 2, 3, 7] 

Using list comprehension

>>> [i for i in s.index if s[i]] [0, 2, 3, 7] 
like image 168
rafaelc Avatar answered Sep 24 '22 20:09

rafaelc


As an addition to rafaelc's answer, here are the according times (from quickest to slowest) for the following setup

import numpy as np import pandas as pd s = pd.Series([x > 0.5 for x in np.random.random(size=1000)]) 

Using np.where

>>> timeit np.where(s)[0] 12.7 µs ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 

Using np.flatnonzero

>>> timeit np.flatnonzero(s) 18 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 

Using pd.Series.index

The time difference to boolean indexing was really surprising to me, since the boolean indexing is usually more used.

>>> timeit s.index[s] 82.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 

Using Boolean Indexing

>>> timeit s[s].index 1.75 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

If you need a np.array object, get the .values

>>> timeit s[s].index.values 1.76 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

If you need a slightly easier to read version <-- not in original answer

>>> timeit s[s==True].index 1.89 ms ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 

Using pd.Series.where <-- not in original answer

>>> timeit s.where(s).dropna().index 2.22 ms ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  >>> timeit s.where(s == True).dropna().index 2.37 ms ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Using pd.Series.mask <-- not in original answer

>>> timeit s.mask(s).dropna().index 2.29 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  >>> timeit s.mask(s == True).dropna().index 2.44 ms ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Using list comprehension

>>> timeit [i for i in s.index if s[i]] 13.7 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Using python's built-in filter

>>> timeit [*filter(s.get, s.index)] 14.2 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  

Using np.nonzero <-- did not work out of the box for me

>>> timeit np.nonzero(s) ValueError: Length of passed values is 1, index implies 1000. 

Using np.argwhere <-- did not work out of the box for me

>>> timeit np.argwhere(s).ravel() ValueError: Length of passed values is 1, index implies 1000. 

like image 33
Christian Steinmeyer Avatar answered Sep 24 '22 20:09

Christian Steinmeyer