The Problem: A large static list of strings is provided as <code>A</code>, A long string is provided as <code>B</code>, strings in <code>A</code> are all very short (a keywords list), I want to check if every string in <code>A</code> is a sub-string of <code>B</code> and get them. Now I use a simple loop like: <pre class="prettyprint"><code>result = [] for word in A: if word in B: result.append(word) </code></pre> But it's crazy slow when A contains ~500,000 or more items. Is there any library or algorithm that fits this problem? I've tried my best to search but no luck. Thank you!

Your problem is large enough that you probably need to hit it with the algorithm bat. Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles. Also, look into the work by Nicholas Lehuen with his PyTST package. There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

High performance mass short string search in Python

Tags:

python

string

search

The Problem: A large static list of strings is provided as A, A long string is provided as B, strings in A are all very short (a keywords list), I want to check if every string in A is a sub-string of B and get them.

Now I use a simple loop like:

result = []
for word in A:
    if word in B:
        result.append(word)

But it's crazy slow when A contains ~500,000 or more items.

Is there any library or algorithm that fits this problem? I've tried my best to search but no luck.

Thank you!

906

asked Jan 13 '12 02:01

Felix Yan

1 Answers

Your problem is large enough that you probably need to hit it with the algorithm bat.

Take a look into the Aho-Corasick Algorithm. Your problem statement is a paraphrase of the problem that this algorithm tackles.

Also, look into the work by Nicholas Lehuen with his PyTST package.

There are also references in a related Stack Overflow message that mention other algorithms such as Rabin-Karp: Algorithm for linear pattern matching?

148

answered Sep 25 '22 18:09

dyoo

Related questions
                            
                                Efficient arbitrary-sized integer packing in Python
                            
                                Creating a logging handler to connect to Oracle? [closed]
                            
                                Django Reporting Options
                            
                                In Python, find item in list of dicts, using bisect
                            
                                Reverse Geocoding Without Web Access
                            
                                Python recursion with list returns None [duplicate]
                            
                                How do you 'remove' a numpy array from a list of numpy arrays?
                            
                                How to move a local django made site into another machine?
                            
                                Python correctness (i.e., lint) analyzing for Notepad++
                            
                                How is introspection useful?
                            
                                Elegant pattern for mutually exclusive keyword args?
                            
                                Is it possible to override Sphinx autodoc for specific functions?
                            
                                Python Smooth Time Series Data
                            
                                Simple python inheritance
                            
                                How do i parse a string in python and write it as an xml to a new xml file?
                            
                                Is it ok to spawn threads in a wsgi-application?
                            
                                Pythonwin - print function not working [duplicate]
                            
                                Python: Getting files into an archive without the directory?
                            
                                PyPy significantly slower than CPython
                            
                                Iterate through words of a file in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With