Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently select rows that match one of several values in Pandas DataFrame

Tags:

python

pandas

People also ask

How do you select a range of rows in a DataFrame?

To select the rows, the syntax is df. loc[start:stop:step] ; where start is the name of the first-row label to take, stop is the name of the last row label to take, and step as the number of indices to advance after each extraction; for example, you can use it to select alternate rows.


You can use the isin Series method:

In [11]: df['Name'].isin(['Alice', 'Bob'])
Out[11]: 
0     True
1     True
2    False
3     True
4    False
Name: Name, dtype: bool

In [12]: df[df.Name.isin(['Alice', 'Bob'])]
Out[12]: 
    Name  Amount
0  Alice     100
1    Bob      50
3  Alice      30

Since, in your actual use case, the values in df['Name'] are ints, you might be able to generate the boolean mask faster using NumPy indexing instead of Series.isin.

idx = np.zeros(N, dtype='bool')
idx[names] = True
df[idx[df['Name'].values]]

For example, given this setup:

import pandas as pd
import numpy as np

N = 100000
df = pd.DataFrame(np.random.randint(N, size=(10**6, 2)), columns=['Name', 'Amount'])
names = np.random.choice(np.arange(N), size=100, replace=False)

In [81]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
100 loops, best of 3: 9.88 ms per loop

In [82]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 107 ms per loop

In [83]: 107/9.88
Out[83]: 10.82995951417004

N is (essentially) the maximum value that df['Names'] can attain. If N is smaller, the speed benefit is not as large. With N = 200,

In [93]: %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
10 loops, best of 3: 62.6 ms per loop

In [94]: %timeit df[df.Name.isin(names)]
10 loops, best of 3: 178 ms per loop

In [95]: 178/62.6
Out[95]: 2.8434504792332267

Caution: As shown above, there seems to be a speed benefit, particularly as N gets large. However, if N is too large, then forming idx = np.zeros(N, dtype='bool') may not be feasible.


Sanity check:

expected = df[df.Name.isin(names)]
idx = np.zeros(N, dtype='bool')
idx[names] = True
result = df[idx[df['Name'].values]]
assert expected.equals(result)