I need to filter rows in a <code>pandas</code> dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive. For example: <pre class="prettyprint"><code>lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*'] </code></pre> I currently apply the mask like this: <pre class="prettyprint"><code>mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst]) df = df[mask] </code></pre> My dataframe is large (~1mio rows) and <code>lst</code> has length 100. Is there a more efficient way? For example, if the first item in <code>lst</code> is found, we should not have to test any subsequent strings for that row.

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters). This is easy to do using <code>re.escape</code>: <pre class="prettyprint"><code>>>> import re >>> esc_lst = [re.escape(s) for s in lst] </code></pre> These escaped substrings can then be joined using a regex pipe <code>|</code>. Each of the substrings can be checked against a string until one matches (or they have all been tested). <pre class="prettyprint"><code>>>> pattern = '|'.join(esc_lst) </code></pre> The masking stage then becomes a single low-level loop through the rows: <pre class="prettyprint"><code>df[col].str.contains(pattern, case=False) </code></pre> <hr> Here's a simple setup to get a sense of performance: <pre class="prettyprint"><code>from random import randint, seed seed(321) # 100 substrings of 5 characters lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)] # 50000 strings of 20 characters strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)] col = pd.Series(strings) esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst) </code></pre> The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows): <pre class="prettyprint"><code>%timeit col.str.contains(pattern, case=False) 1 loop, best of 3: 981 ms per loop </code></pre> The method in the question took approximately 5 seconds using the same input data. It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

You could try using the Aho-Corasick algorithm. In the average case, it is <code>O(n+m+p)</code> where <code>n</code> is length of the search strings and <code>m</code> is the length of the searched text and <code>p</code> is the number of output matches. The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack). pyahocorasick is a Python wrapper around a C implementation of the algorithm. <hr> Let's compare how fast it is versus some alternatives. Below is a benchmark showing <code>using_aho_corasick</code> to be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case: <pre class="prettyprint"><code>| | speed factor | ms per loop | | | compared to orig | | |--------------------+------------------+-------------| | using_aho_corasick | 30.7x | 140 | | using_regex | 2.7x | 1580 | | orig | 1.0x | 4300 | </code></pre> <hr> <pre class="prettyprint"><code>In [89]: %timeit using_ahocorasick(col, lst) 10 loops, best of 3: 140 ms per loop In [88]: %timeit using_regex(col, lst) 1 loop, best of 3: 1.58 s per loop In [91]: %timeit orig(col, lst) 1 loop, best of 3: 4.3 s per loop </code></pre> <hr> Here the setup used for the benchmark. It also verifies that the output matches the result returned by <code>orig</code>: <pre class="prettyprint"><code>import numpy as np import random import pandas as pd import ahocorasick import re random.seed(321) def orig(col, lst): mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False) for i in lst]) return mask def using_regex(col, lst): """https://stackoverflow.com/a/48590850/190597 (Alex Riley)""" esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst) mask = col.str.contains(pattern, case=False) return mask def using_ahocorasick(col, lst): A = ahocorasick.Automaton(ahocorasick.STORE_INTS) for word in lst: A.add_word(word.lower()) A.make_automaton() col = col.str.lower() mask = col.apply(lambda x: bool(list(A.iter(x)))) return mask N = 50000 # 100 substrings of 5 characters lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)] # N strings of 20 characters strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)] # make about 10% of the strings match a string from lst; this helps check that our method works strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings] col = pd.Series(strings) expected = orig(col, lst) for name, result in [('using_regex', using_regex(col, lst)), ('using_ahocorasick', using_ahocorasick(col, lst))]: status = 'pass' if np.allclose(expected, result) else 'fail' print('{}: {}'.format(name, status)) </code></pre>

Pandas filtering for multiple substrings in series

Tags:

python

string

pandas

dataframe

series

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.

For example:

lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']

I currently apply the mask like this:

mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst]) df = df[mask]

My dataframe is large (~1mio rows) and lst has length 100. Is there a more efficient way? For example, if the first item in lst is found, we should not have to test any subsequent strings for that row.

670

asked Jan 31 '18 11:01

jpp

2 Answers

If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

This is easy to do using re.escape:

>>> import re >>> esc_lst = [re.escape(s) for s in lst]

These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).

>>> pattern = '|'.join(esc_lst)

The masking stage then becomes a single low-level loop through the rows:

df[col].str.contains(pattern, case=False)

Here's a simple setup to get a sense of performance:

from random import randint, seed  seed(321)  # 100 substrings of 5 characters lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]  # 50000 strings of 20 characters strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]  col = pd.Series(strings) esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst)

The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):

%timeit col.str.contains(pattern, case=False) 1 loop, best of 3: 981 ms per loop

The method in the question took approximately 5 seconds using the same input data.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.

answered Oct 09 '22 11:10

Alex Riley

You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p) where n is length of the search strings and m is the length of the searched text and p is the number of output matches.

The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack).

pyahocorasick is a Python wrapper around a C implementation of the algorithm.

Let's compare how fast it is versus some alternatives. Below is a benchmark showing using_aho_corasick to be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case:

|                    |     speed factor | ms per loop | |                    | compared to orig |             | |--------------------+------------------+-------------| | using_aho_corasick |            30.7x |         140 | | using_regex        |             2.7x |        1580 | | orig               |             1.0x |        4300 |

In [89]: %timeit using_ahocorasick(col, lst) 10 loops, best of 3: 140 ms per loop  In [88]: %timeit using_regex(col, lst) 1 loop, best of 3: 1.58 s per loop  In [91]: %timeit orig(col, lst) 1 loop, best of 3: 4.3 s per loop

Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig:

import numpy as np import random import pandas as pd import ahocorasick import re  random.seed(321)  def orig(col, lst):     mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False)                                   for i in lst])     return mask  def using_regex(col, lst):     """https://stackoverflow.com/a/48590850/190597 (Alex Riley)"""     esc_lst = [re.escape(s) for s in lst]     pattern = '|'.join(esc_lst)     mask = col.str.contains(pattern, case=False)     return mask  def using_ahocorasick(col, lst):     A = ahocorasick.Automaton(ahocorasick.STORE_INTS)     for word in lst:         A.add_word(word.lower())     A.make_automaton()      col = col.str.lower()     mask = col.apply(lambda x: bool(list(A.iter(x))))     return mask  N = 50000 # 100 substrings of 5 characters lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]  # N strings of 20 characters strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)] # make about 10% of the strings match a string from lst; this helps check that our method works strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]  col = pd.Series(strings)  expected = orig(col, lst) for name, result in [('using_regex', using_regex(col, lst)),                      ('using_ahocorasick', using_ahocorasick(col, lst))]:     status = 'pass' if np.allclose(expected, result) else 'fail'     print('{}: {}'.format(name, status))

answered Oct 09 '22 13:10

unutbu

Related questions
                            
                                Flask-RESTful API: multiple and complex endpoints
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Get max key in dictionary
                            
                                Python 3, Are there any known security holes in ast.literal_eval(node_or_string)?
                            
                                python split a string with at least 2 whitespaces
                            
                                Celery: How to ignore task result in chord or chain?
                            
                                Find index where elements change value numpy
                            
                                converting a list of integers into range in python
                            
                                How to make List from Numpy Matrix in Python
                            
                                pyqt: how to remove a widget?
                            
                                Using mock to patch a celery task in Django unit tests
                            
                                How to append in a json file in Python?
                            
                                Reverse string: string[::-1] works, but string[0::-1] and others don't
                            
                                Test if dict contained in dict
                            
                                Counting the Frequency of words in a pandas data frame
                            
                                Similar to Pass in Python for C#
                            
                                How do you get the current text contents of a QComboBox?
                            
                                Flask: How to remove cookies?
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2
                            
                                %matplotlib notebook showing a blank histogram

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With