Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make pandas dataframe str.contains search faster

I'm searching for a substring or multiple substrings in the dataframe of 4 million rows.

df[df.col.str.contains('Donald',case=True,na=False)]

or

df[df.col.str.contains('Donald|Trump|Dump',case=True,na=False)]

DataFrame(df) looks like below (with 4 million string rows)

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})

Is there any tip to make this string search faster? For example, sorting dataframe first, certain way of indexing, changing column names to numbers, dropping "na=False" from the query, etc.? Even milliseconds of speed increase will be very helpful!

like image 457
aerin Avatar asked Jun 18 '16 06:06

aerin


People also ask

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

Is pandas indexing fast?

Like a Python dictionary (or a relational database's index), Pandas indexing provides a fast way to turn a key into a value.

Are pandas faster than dictionary?

For certain small, targeted purposes, a dict may be faster. And if that is all you need, then use a dict, for sure! But if you need/want the power and luxury of a DataFrame, then a dict is no substitute. It is meaningless to compare speed if the data structure does not first satisfy your needs.

How to search for a string in a pandas column?

In this tutorial, we will look at how to search for a string (or a substring) in a pandas dataframe column with the help of some examples. How to check if a pandas series contains a string? You can use the pandas.series.str.contains () function to search for the presence of a string in a pandas series (or column of a dataframe).

What are pandas options to solve Dataframe problems?

Pandas options (1) There’re too many columns / rows in the dataframe and some columns / rows in the middle are omitted. (2) Columns containing long texts get truncated. (3) Columns containing floats display too many / too few digits. to solve these issues. 3. Group by with multiple aggregations

How to access the values of a pandas series as strings?

Python | Pandas Series.str.contains () Series.str can be used to access the values of the series as strings and apply several methods to it. Pandas Series.str.contains() function is used to test if pattern or regex is contained within a string of a Series or Index. The function return boolean Series or Index based on whether a given pattern...

How to test if a string contains a pattern in pandas?

Pandas Series.str.contains () function is used to test if pattern or regex is contained within a string of a Series or Index. The function return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index. Syntax: Series.str.contains (pat, case=True, flags=0, na=nan, regex=True) Parameter :


2 Answers

If the number of substrings is small, it may be faster to search for one at a time, because you can pass the regex=False argument to contains, which speeds it up.

On a sample DataFrame of about 6000 rows that I tested it with on two sample substrings, blah.contains("foo", regex=False)| blah.contains("bar", regex=False) was about twice as fast as blah.contains("foo|bar"). You'd have to test it with your data to see how it scales.

like image 74
BrenBarn Avatar answered Oct 13 '22 10:10

BrenBarn


You could converting it to a list. It seems that searching in a list rather than applying string methods to a series is significantly faster.

Sample code:

import timeit
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})



def first_way():
    df["new"] = pd.Series(df["col"].str.contains('Donald',case=True,na=False))
    return None
print "First_way: "
%timeit for x in range(10): first_way()
print df

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})


def second_way():
    listed = df["col"].tolist()
    df["new"] = ["Donald" in n for n in listed]
    return None

print "Second way: "
%timeit for x in range(10): second_way()
print df

Results:

First_way: 
100 loops, best of 3: 2.77 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False
Second way: 
1000 loops, best of 3: 1.79 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False
like image 36
Yarnspinner Avatar answered Oct 13 '22 10:10

Yarnspinner