I need to filter rows in a pandas
dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unusual / regex characters. The comparison should not involve regex and is case insensitive.
For example:
lst = ['kdSj;af-!?', 'aBC+dsfa?\-', 'sdKaJg|dksaf-*']
I currently apply the mask like this:
mask = np.logical_or.reduce([df[col].str.contains(i, regex=False, case=False) for i in lst]) df = df[mask]
My dataframe is large (~1mio rows) and lst
has length 100. Is there a more efficient way? For example, if the first item in lst
is found, we should not have to test any subsequent strings for that row.
Using Series. Series. str. contains() method in pandas allows you to search a column for a specific substring. The contains() method returns boolean values for the series with True when the original Series value contains the substring and False if not.
In the same way you can't attach a specific data type to list , even if all elements are of the same type, a Pandas object series contains pointers to any number of types.
Using “contains” to Find a Substring in a Pandas DataFrame The contains method in Pandas allows you to search a column for a specific substring. The contains method returns boolean values for the Series with True for if the original Series value contains the substring and False if not.
If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).
This is easy to do using re.escape
:
>>> import re >>> esc_lst = [re.escape(s) for s in lst]
These escaped substrings can then be joined using a regex pipe |
. Each of the substrings can be checked against a string until one matches (or they have all been tested).
>>> pattern = '|'.join(esc_lst)
The masking stage then becomes a single low-level loop through the rows:
df[col].str.contains(pattern, case=False)
Here's a simple setup to get a sense of performance:
from random import randint, seed seed(321) # 100 substrings of 5 characters lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)] # 50000 strings of 20 characters strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)] col = pd.Series(strings) esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst)
The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):
%timeit col.str.contains(pattern, case=False) 1 loop, best of 3: 981 ms per loop
The method in the question took approximately 5 seconds using the same input data.
It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.
You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p)
where n
is length of the search strings and m
is the length of the searched text and p
is the number of output matches.
The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack).
pyahocorasick is a Python wrapper around a C implementation of the algorithm.
Let's compare how fast it is versus some alternatives. Below is a benchmark showing using_aho_corasick
to be over 30x faster than the original method (shown in the question) on a 50K-row DataFrame test case:
| | speed factor | ms per loop | | | compared to orig | | |--------------------+------------------+-------------| | using_aho_corasick | 30.7x | 140 | | using_regex | 2.7x | 1580 | | orig | 1.0x | 4300 |
In [89]: %timeit using_ahocorasick(col, lst) 10 loops, best of 3: 140 ms per loop In [88]: %timeit using_regex(col, lst) 1 loop, best of 3: 1.58 s per loop In [91]: %timeit orig(col, lst) 1 loop, best of 3: 4.3 s per loop
Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig
:
import numpy as np import random import pandas as pd import ahocorasick import re random.seed(321) def orig(col, lst): mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False) for i in lst]) return mask def using_regex(col, lst): """https://stackoverflow.com/a/48590850/190597 (Alex Riley)""" esc_lst = [re.escape(s) for s in lst] pattern = '|'.join(esc_lst) mask = col.str.contains(pattern, case=False) return mask def using_ahocorasick(col, lst): A = ahocorasick.Automaton(ahocorasick.STORE_INTS) for word in lst: A.add_word(word.lower()) A.make_automaton() col = col.str.lower() mask = col.apply(lambda x: bool(list(A.iter(x)))) return mask N = 50000 # 100 substrings of 5 characters lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)] # N strings of 20 characters strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)] # make about 10% of the strings match a string from lst; this helps check that our method works strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings] col = pd.Series(strings) expected = orig(col, lst) for name, result in [('using_regex', using_regex(col, lst)), ('using_ahocorasick', using_ahocorasick(col, lst))]: status = 'pass' if np.allclose(expected, result) else 'fail' print('{}: {}'.format(name, status))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With