Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to select rows that contain a value in a Pandas dataframe?

I am currently following the instructions laid out here for finding values, and it works. The only problem is my dataframe is quite big (5x3500 rows) and I need to perform around ~2000 searches. Each one takes around 4 seconds, so obviously this adds up and has become a bit unsustainable on my end.

Most concise way to select rows where any column contains a string in Pandas dataframe?

Is there a faster way to search for all rows containing a string value than this?

df[df.apply(lambda r: r.str.contains('b', case=False).any(), axis=1)] 
like image 632
NBC Avatar asked Feb 02 '19 01:02

NBC


People also ask

How do you select rows of pandas DataFrame based on values in a list?

isin() to Select Rows From List of Values. DataFrame. isin() method is used to filter/select rows from a list of values. You can have the list of values in variable and use it on isin() or use it directly.

Is apply faster than a for loop pandas?

apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).


2 Answers

You can testing the speed

boolfilter=(np.char.find(df.values.ravel().astype(str),'b')!=-1).reshape(df.shape).any(1)
boolfilter
array([False,  True,  True])
newdf=df[boolfilter]
like image 52
BENY Avatar answered Oct 21 '22 17:10

BENY


One trivial possibility is to disable regex:

res = df[df.apply(lambda r: r.str.contains('b', case=False, regex=False).any(), axis=1)] 

Another way using a list comprehension:

res = df[[any('b' in x.lower() for x in row) for row in df.values)]]
like image 23
jpp Avatar answered Oct 21 '22 15:10

jpp