I have a static list of 4000 distinct first names: so the length of the list is big (4000), but each string has about from 4 to 12 characters (they are names).
Then, I have a dynamic list of 10000 strings retrieved from a database: these strings may have an arbitrary length.
I need to output, for each one of the 10000 strings, whether that string contains one of the 4000 names, and if so, which one. If it contains more than one name, I need just one of them (i.e. the first). Moreover, it is unlikely to find such names, so maybe just 10 of the 10000 will contain a name.
My code so far:
names # list of 4000 short static names
fields # list of 10000 retrieved strings
def findit(element):
for name in names:
if name in element:
return name
return None
output = [findit(element) for element in fields]
This works of course. However, it is totally slow, since it is unlikely to find a name, and since I'm testing for being a substring and not for equality (i.e. I cannot use bisect or other sorted based index techniques). It scans all the names list completely, almost all the time. So basically, it executes about 10000 x 4000 = 40 million "in" comparisons.
Is there an algorithm to optimize this kind of search?
Use any() to check if a list contains a substring. Call any(iterable) with iterable as a for-loop that checks if any element in the list contains the substring. Alternatively, use a list comprehension to construct a new list containing each element that contains the substring.
The easiest way to check if a Python string contains a substring is to use the in operator. The in operator is used to check data structures for membership in Python. It returns a Boolean (either True or False ).
You can use contains(), indexOf() and lastIndexOf() method to check if one String contains another String in Java or not. If a String contains another String then it's known as a substring. The indexOf() method accepts a String and returns the starting position of the string if it exists, otherwise, it will return -1.
You could look into turning your list of names into one regular expression. Take for example this tiny list of names:
names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]
That could be represented with the following regular expression:
\b(?:AARON|ABDUL|ABE|ABEL|ABRAHAM|ABRAM|ADALBERTO|ADAM|ADAN|ADOLFO|ADOLPH|ADRIAN)\b
But that would not be very efficient. A regular expression that is built like a tree will work better:
\b(?:A(?:B(?:E(?:|L)|RA(?:M|HAM)|DUL)|D(?:A(?:M|N|LBERTO)|OL(?:FO|PH)|RIAN)|ARON))\b
You could then automate the production of this regular expression -- possibly by first creating a dict
-tree structure from the list of names, and then translating that tree into a regular expression. For the above example, that intermediate tree would look like this:
{
'A': {
'A': {
'R': {
'O': {
'N': {
'': {}
}
}
}
},
'B': {
'D': {
'U': {
'L': {
'': {}
}
}
},
'E': {
'': {},
'L': {
'': {}
}
},
... etc
... which could optionally be simplified to this:
{
'A': {
'ARON': {
'': {}
}
'B': {
'DUL': {
'': {}
},
'E': {
'': {},
'L': {
'': {}
}
},
'RA': {
'HAM': {
'': {}
},
'M': {
'': {}
}
}
},
... etc
Here is the suggested code to do this:
import re
def addToTree(tree, name):
if len(name) == 0:
return
if name[0] in tree.keys():
addToTree(tree[name[0]], name[1:])
else:
for letter in name:
tree[letter] = {}
tree = tree[letter]
tree[''] = {}
# Optional improvement of the tree: it combines several consecutive letters into
# one key if there are no alternatives possible
def simplifyTree(tree):
repeat = True
while repeat:
repeat = False
for key, subtree in list(tree.items()):
if key != '' and len(subtree) == 1 and '' not in subtree.keys():
for letter, subsubtree in subtree.items():
tree[key + letter] = subsubtree
del tree[key]
repeat = True
for key, subtree in tree.items():
if key != '':
simplifyTree(subtree)
def treeToRegExp(tree):
regexp = [re.escape(key) + treeToRegExp(subtree) for key, subtree in tree.items()]
regexp = '|'.join(regexp)
return '' if regexp == '' else '(?:' + regexp + ')'
def listToRegExp(names):
tree = {}
for name in names:
addToTree(tree, name[:])
simplifyTree(tree)
return re.compile(r'\b' + treeToRegExp(tree) + r'\b', re.I)
# Demo
names = ['AARON',
'ABDUL',
'ABE',
'ABEL',
'ABRAHAM',
'ABRAM',
'ADALBERTO',
'ADAM',
'ADAN',
'ADOLFO',
'ADOLPH',
'ADRIAN',
]
fields = [
'This is Aaron speaking',
'Is Abex a name?',
'Where did Abraham get the mustard from?'
]
regexp = listToRegExp(names)
# get the search result for each field, and link it with the index of the field
results = [[i, regexp.search(field)] for i, field in enumerate(fields)]
# remove non-matches from the results
results = [[i, match.group(0)] for [i, match] in results if match]
# print results
print(results)
See it run on repl.it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With