Given a list ["one", "two", "three"]
, how to determine if each word exist in a specified string?
The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)
My current implementation uses re
to look for matches but I'm not sure if it's the best way.
import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)
finder = re.compile(regex_string)
string_to_be_searched = "one two three"
results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
if word in result_set:
print("%s in string" % word)
Problems in my solution:
Possible simpler implementation:
if word in string_to_be_searched
. But it can not deal with "threesome" if you are looking for "three"UPDATE:
I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view.
Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: word_list = ["one day", "second day"]
. Maybe I should ask another question.
Using any() to check if string contains element from list. Using any function is the most classical way in which you can perform this task and also efficiently. This function checks for match in string with match of each element of list.
Just use all() and check for types with isinstance() .
Use the all() function to check if a list contains only strings, e.g. if all(isinstance(item, str) for item in my_list): . The all() function will return True if the list contains only strings and False otherwise.
To check if string contains substring from a list of strings, iterate over list of strings, and for each item in the list, check if the item is present in the given string. a source string, in which we have to check if some substring is present.
Easy way:
filter(lambda x:x in string,search_list)
if you want the search to ignore character's case you can do this:
lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)
if you want to ignore words that are part of bigger word such as three in threesome:
lower_string=string.lower()
result=[]
if ' ' in lower_string:
result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
substr=lower_string[:lower_string.find(' ')]
if substr in search_list and substr not in result:
result+=[substr]
substr=lower_string[lower_string.rfind(' ')+1:]
if substr in search_list and substr not in result:
result+=[substr]
else:
if lower_string in search_list:
result=[lower_string]
arr=string.split(' ')
result=list(set(arr).intersection(set(search_list)))
EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster.
def safe_remove(arr,elem):
try:
arr.remove(elem)
except:
pass
not_found=search_list[:]
i=string.find(' ')
j=string.find(' ',i+1)
safe_remove(not_found,string[:i])
while j!=-1:
safe_remove(not_found,string[i+1:j])
i,j=j,string.find(' ',j+1)
safe_remove(not_found,string[i+1:])
not_found
list contains words that are not found, you can get the found list easily, one way is list(set(search_list)-set(not_found))
EDIT: the last method appears to be the slowest.
This function was found by Peter Gibson (below) to be the most performant of the answers here. It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):
def words_in_string(word_list, a_string):
return set(word_list).intersection(a_string.split())
Usage:
my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
print('One or more words found!')
Which prints One or words found!
to stdout.
It does return the actual words found:
for word in words_in_string(my_word_list, a_string):
print(word)
Prints out:
three
two
one
For data so large you can't hold it in memory, the solution given in this answer would be very performant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With