Fastest way to check whether a string is a substring in a list of strings

Tags:

python-3.x

I have a static list of 4000 distinct first names: so the length of the list is big (4000), but each string has about from 4 to 12 characters (they are names).

Then, I have a dynamic list of 10000 strings retrieved from a database: these strings may have an arbitrary length.

I need to output, for each one of the 10000 strings, whether that string contains one of the 4000 names, and if so, which one. If it contains more than one name, I need just one of them (i.e. the first). Moreover, it is unlikely to find such names, so maybe just 10 of the 10000 will contain a name.

My code so far:

names # list of 4000 short static names
fields # list of 10000 retrieved strings

def findit(element):
    for name in names:
        if name in element:
            return name
    return None

output = [findit(element) for element in fields]

This works of course. However, it is totally slow, since it is unlikely to find a name, and since I'm testing for being a substring and not for equality (i.e. I cannot use bisect or other sorted based index techniques). It scans all the names list completely, almost all the time. So basically, it executes about 10000 x 4000 = 40 million "in" comparisons.

Is there an algorithm to optimize this kind of search?

474

asked Sep 06 '17 09:09

edoedoedo

1 Answers

You could look into turning your list of names into one regular expression. Take for example this tiny list of names:

names = ['AARON',
    'ABDUL',
    'ABE',
    'ABEL',
    'ABRAHAM',
    'ABRAM',
    'ADALBERTO',
    'ADAM',
    'ADAN',
    'ADOLFO',
    'ADOLPH',
    'ADRIAN',
]

That could be represented with the following regular expression:

\b(?:AARON|ABDUL|ABE|ABEL|ABRAHAM|ABRAM|ADALBERTO|ADAM|ADAN|ADOLFO|ADOLPH|ADRIAN)\b

But that would not be very efficient. A regular expression that is built like a tree will work better:

\b(?:A(?:B(?:E(?:|L)|RA(?:M|HAM)|DUL)|D(?:A(?:M|N|LBERTO)|OL(?:FO|PH)|RIAN)|ARON))\b

You could then automate the production of this regular expression -- possibly by first creating a dict-tree structure from the list of names, and then translating that tree into a regular expression. For the above example, that intermediate tree would look like this:

... which could optionally be simplified to this:

{
    'A': {
        'ARON': {
            '': {}
        }
        'B': {
            'DUL': {
                '': {}
            },
            'E': {
                '': {}, 
                'L': {
                    '': {}
                }
            },
            'RA': {
                'HAM': {
                    '': {}
                },
                'M': {
                    '': {}
                } 
            } 
        }, 

... etc

Here is the suggested code to do this:

import re 

def addToTree(tree, name):
    if len(name) == 0:
        return
    if name[0] in tree.keys():
        addToTree(tree[name[0]], name[1:])
    else:
        for letter in name:
            tree[letter] = {}
            tree = tree[letter]
        tree[''] = {}

# Optional improvement of the tree: it combines several consecutive letters into 
# one key if there are no alternatives possible
def simplifyTree(tree):
    repeat = True
    while repeat:
        repeat = False
        for key, subtree in list(tree.items()):
            if key != '' and len(subtree) == 1 and '' not in subtree.keys():
                for letter, subsubtree in subtree.items():
                    tree[key + letter] = subsubtree
                del tree[key]
                repeat = True
    for key, subtree in tree.items():
        if key != '': 
            simplifyTree(subtree)

def treeToRegExp(tree):
    regexp = [re.escape(key) + treeToRegExp(subtree) for key, subtree in tree.items()]
    regexp = '|'.join(regexp)
    return '' if regexp == '' else '(?:' + regexp + ')'

def listToRegExp(names):
    tree = {}
    for name in names:
        addToTree(tree, name[:])
    simplifyTree(tree)
    return re.compile(r'\b' + treeToRegExp(tree) + r'\b', re.I)

# Demo
names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]

fields = [
    'This is Aaron speaking',
    'Is Abex a name?',
    'Where did Abraham get the mustard from?'
]

regexp = listToRegExp(names)
# get the search result for each field, and link it with the index of the field
results = [[i, regexp.search(field)] for i, field in enumerate(fields)]
# remove non-matches from the results
results = [[i, match.group(0)] for [i, match] in results if match]
# print results
print(results)

See it run on repl.it

115

answered Oct 11 '22 21:10

trincot

Related questions
                            
                                ModuleNotFoundError on console_scripts execution
                            
                                How to flip a byte in python?
                            
                                Python 3 Update MySQL database table
                            
                                No module named 'info' on fresh Python 3 installation
                            
                                pyldavis Unable to view the graph
                            
                                Semantics of `async for` - can __anext__ calls overlap?
                            
                                Cannot Upgrade from python 3.5.2 to 3.6
                            
                                Appending to a text file in S3
                            
                                How does 'global' behave under an if statement?
                            
                                Deploying Django Channels to Elastic Beanstalk Python3.4 environment
                            
                                Scrapy shell return without response
                            
                                Python 3 No module named '_ssl'
                            
                                asyncio matplotlib show() still freezes program
                            
                                pandas row values to column headers
                            
                                HTTP Error 403: Forbidden while downloading file using urllib
                            
                                Python 3.6 tkinter window icon on Linux error
                            
                                Keras log_loss error is same
                            
                                Circular imports in classes
                            
                                Import LLDB in Python 3 on Mac?
                            
                                Django Autocomplete Light List Object has no Attribute Queryset

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With