Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas filtering for multiple substrings in series

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

For example:

lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*'] 

I currently apply the mask like this:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst]) df = df[mask] 

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

like image 670
jpp Avatar asked Jan 31 '18 11:01

jpp


People also ask

How to filter pandas by substring?

Using Series. Series. str. contains() method in pandas allows you to search a column for a specific substring. The contains() method returns boolean values for the series with True when the original Series value contains the substring and False if not.

Can a Pandas series have multiple data types?

In the same way you can't attach a specific data type to list , even if all elements are of the same type, a Pandas object series contains pointers to any number of types.

How do I find a substring in a column in Python?

Using “contains” to Find a Substring in a Pandas DataFrame The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not.


2 Answers

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

This is easy to do using re.escape:

>>> import re >>> esc_lst = [re.escape(s) for s in lst] 

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst) 

The masking stage then becomes a single low-level loop through the rows:

df[col].str.contains(pattern, case=False) 

Here's a simple setup to get a sense of performance:

from random import randint, seed  seed(321)  # 100 substrings of 5 characters lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]  # 50000 strings of 20 characters strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]  col = pd.Series(strings) esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst) 

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

%timeit col.str.contains(pattern, case=False) 1 loop, best of 3: 981 ms per loop 

The method in the question took approximately 5 seconds using the same input data.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

like image 78
Alex Riley Avatar answered Oct 09 '22 11:10

Alex Riley


You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p) where n is length of the search strings and m is the length of the searched text and p is the number of output matches.

The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack).

pyahocorasick is a Python wrapper around a C implementation of the algorithm.


Let's compare how fast it is versus some alternatives. Below is a benchmark showing using_aho_corasick to be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case:

|                    |     speed factor | ms per loop | |                    | compared to orig |             | |--------------------+------------------+-------------| | using_aho_corasick |            30.7x |         140 | | using_regex        |             2.7x |        1580 | | orig               |             1.0x |        4300 | 

In [89]: %timeit using_ahocorasick(col, lst) 10 loops, best of 3: 140 ms per loop  In [88]: %timeit using_regex(col, lst) 1 loop, best of 3: 1.58 s per loop  In [91]: %timeit orig(col, lst) 1 loop, best of 3: 4.3 s per loop 

Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig:

import numpy as np import random import pandas as pd import ahocorasick import re  random.seed(321)  def orig(col, lst):     mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False)                                   for i in lst])     return mask  def using_regex(col, lst):     """https://stackoverflow.com/a/48590850/190597 (Alex Riley)"""     esc_lst = [re.escape(s) for s in lst]     pattern = '|'.join(esc_lst)     mask = col.str.contains(pattern, case=False)     return mask  def using_ahocorasick(col, lst):     A = ahocorasick.Automaton(ahocorasick.STORE_INTS)     for word in lst:         A.add_word(word.lower())     A.make_automaton()      col = col.str.lower()     mask = col.apply(lambda x: bool(list(A.iter(x))))     return mask  N = 50000 # 100 substrings of 5 characters lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]  # N strings of 20 characters strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)] # make about 10% of the strings match a string from lst; this helps check that our method works strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]  col = pd.Series(strings)  expected = orig(col, lst) for name, result in [('using_regex', using_regex(col, lst)),                      ('using_ahocorasick', using_ahocorasick(col, lst))]:     status = 'pass' if np.allclose(expected, result) else 'fail'     print('{}: {}'.format(name, status)) 
like image 26
unutbu Avatar answered Oct 09 '22 13:10

unutbu