Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to check whether a string is a substring in a list of strings

Tags:

python-3.x

I have a static list of 4000 distinct first names: so the length of the list is big (4000), but each string has about from 4 to 12 characters (they are names).

Then, I have a dynamic list of 10000 strings retrieved from a database: these strings may have an arbitrary length.

I need to output, for each one of the 10000 strings, whether that string contains one of the 4000 names, and if so, which one. If it contains more than one name, I need just one of them (i.e. the first). Moreover, it is unlikely to find such names, so maybe just 10 of the 10000 will contain a name.

My code so far:

names # list of 4000 short static names
fields # list of 10000 retrieved strings

def findit(element):
    for name in names:
        if name in element:
            return name
    return None

output = [findit(element) for element in fields]

This works of course. However, it is totally slow, since it is unlikely to find a name, and since I'm testing for being a substring and not for equality (i.e. I cannot use bisect or other sorted based index techniques). It scans all the names list completely, almost all the time. So basically, it executes about 10000 x 4000 = 40 million "in" comparisons.

Is there an algorithm to optimize this kind of search?

like image 474
edoedoedo Avatar asked Sep 06 '17 09:09

edoedoedo


People also ask

How do you check if a substring is in a list of strings?

Use any() to check if a list contains a substring. Call any(iterable) with iterable as a for-loop that checks if any element in the list contains the substring. Alternatively, use a list comprehension to construct a new list containing each element that contains the substring.

How do you check if a string has a substring from a list in Python?

The easiest way to check if a Python string contains a substring is to use the in operator. The in operator is used to check data structures for membership in Python. It returns a Boolean (either True or False ).

How do you check if a string contains a substring from a list Java?

You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.


1 Answers

You could look into turning your list of names into one regular expression. Take for example this tiny list of names:

names = ['AARON',
    'ABDUL',
    'ABE',
    'ABEL',
    'ABRAHAM',
    'ABRAM',
    'ADALBERTO',
    'ADAM',
    'ADAN',
    'ADOLFO',
    'ADOLPH',
    'ADRIAN',
]

That could be represented with the following regular expression:

\b(?:AARON|ABDUL|ABE|ABEL|ABRAHAM|ABRAM|ADALBERTO|ADAM|ADAN|ADOLFO|ADOLPH|ADRIAN)\b

But that would not be very efficient. A regular expression that is built like a tree will work better:

\b(?:A(?:B(?:E(?:|L)|RA(?:M|HAM)|DUL)|D(?:A(?:M|N|LBERTO)|OL(?:FO|PH)|RIAN)|ARON))\b

You could then automate the production of this regular expression -- possibly by first creating a dict-tree structure from the list of names, and then translating that tree into a regular expression. For the above example, that intermediate tree would look like this:

{
    'A': {
        'A': {
            'R': {
                'O': {
                    'N': {
                        '': {}
                    }
                }
            }
        }, 
        'B': {
            'D': {
                'U': {
                    'L': {
                        '': {}
                    }
                }
            }, 
            'E': {
                '': {}, 
                'L': {
                    '': {}
                }
            }, 
... etc

... which could optionally be simplified to this:

{
    'A': {
        'ARON': {
            '': {}
        }
        'B': {
            'DUL': {
                '': {}
            },
            'E': {
                '': {}, 
                'L': {
                    '': {}
                }
            },
            'RA': {
                'HAM': {
                    '': {}
                },
                'M': {
                    '': {}
                } 
            } 
        }, 

... etc

Here is the suggested code to do this:

import re 

def addToTree(tree, name):
    if len(name) == 0:
        return
    if name[0] in tree.keys():
        addToTree(tree[name[0]], name[1:])
    else:
        for letter in name:
            tree[letter] = {}
            tree = tree[letter]
        tree[''] = {}

# Optional improvement of the tree: it combines several consecutive letters into 
# one key if there are no alternatives possible
def simplifyTree(tree):
    repeat = True
    while repeat:
        repeat = False
        for key, subtree in list(tree.items()):
            if key != '' and len(subtree) == 1 and '' not in subtree.keys():
                for letter, subsubtree in subtree.items():
                    tree[key + letter] = subsubtree
                del tree[key]
                repeat = True
    for key, subtree in tree.items():
        if key != '': 
            simplifyTree(subtree)

def treeToRegExp(tree):
    regexp = [re.escape(key) + treeToRegExp(subtree) for key, subtree in tree.items()]
    regexp = '|'.join(regexp)
    return '' if regexp == '' else '(?:' + regexp + ')'

def listToRegExp(names):
    tree = {}
    for name in names:
        addToTree(tree, name[:])
    simplifyTree(tree)
    return re.compile(r'\b' + treeToRegExp(tree) + r'\b', re.I)

# Demo
names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]

fields = [
    'This is Aaron speaking',
    'Is Abex a name?',
    'Where did Abraham get the mustard from?'
]

regexp = listToRegExp(names)
# get the search result for each field, and link it with the index of the field
results = [[i, regexp.search(field)] for i, field in enumerate(fields)]
# remove non-matches from the results
results = [[i, match.group(0)] for [i, match] in results if match]
# print results
print(results)

See it run on repl.it

like image 115
trincot Avatar answered Oct 11 '22 21:10

trincot