Given a list <code>["one", "two", "three"]</code>, how to determine if each word exist in a specified string? The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run) My current implementation uses <code>re</code> to look for matches but I'm not sure if it's the best way. <pre class="prettyprint"><code>import re word_list = ["one", "two", "three"] regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list) finder = re.compile(regex_string) string_to_be_searched = "one two three" results = finder.findall(" %s " % string_to_be_searched) result_set = set(results) for word in word_list: if word in result_set: print("%s in string" % word) </code></pre> Problems in my solution: <ol> <li>It will search until the end of the string, although the words may appear in the first half of the string</li> <li>In order to overcome the limitation of lookahead assertion (I don't know how to express "the character before current match should be non-word characters, or the start of the string"), I added extra space before and after the string I need to be searched.</li> <li>Other performance issue introduced by the lookahead assertion?</li> </ol> Possible simpler implementation: <ol> <li>just loop through the word list and do a <code>if word in string_to_be_searched</code>. But it can not deal with "threesome" if you are looking for "three"</li> <li>Use one regular expression search for one word. Still I'm not sure about the performance, and the potential of searching string multiple times.</li> </ol> UPDATE: I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view. Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: <code>word_list = ["one day", "second day"]</code>. Maybe I should ask another question.

Easy way: <pre class="prettyprint"><code>filter(lambda x:x in string,search_list) </code></pre> if you want the search to ignore character's case you can do this: <pre class="prettyprint"><code>lower_string=string.lower() filter(lambda x:x.lower() in lower_string,search_list) </code></pre> if you want to ignore words that are part of bigger word such as three in threesome: <pre class="prettyprint"><code>lower_string=string.lower() result=[] if ' ' in lower_string: result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list) substr=lower_string[:lower_string.find(' ')] if substr in search_list and substr not in result: result+=[substr] substr=lower_string[lower_string.rfind(' ')+1:] if substr in search_list and substr not in result: result+=[substr] else: if lower_string in search_list: result=[lower_string] </code></pre> <hr> If performance is needed: <pre class="prettyprint"><code>arr=string.split(' ') result=list(set(arr).intersection(set(search_list))) </code></pre> EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster. <hr> if string is too long you should do low level search and avoid converting it to list: <pre class="prettyprint"><code>def safe_remove(arr,elem): try: arr.remove(elem) except: pass not_found=search_list[:] i=string.find(' ') j=string.find(' ',i+1) safe_remove(not_found,string[:i]) while j!=-1: safe_remove(not_found,string[i+1:j]) i,j=j,string.find(' ',j+1) safe_remove(not_found,string[i+1:]) </code></pre> <code>not_found</code> list contains words that are not found, you can get the found list easily, one way is <code>list(set(search_list)-set(not_found))</code> EDIT: the last method appears to be the slowest.

Python: how to determine if a list of words exist in a string

Tags:

python

regex

Given a list ["one", "two", "three"], how to determine if each word exist in a specified string?

The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)

My current implementation uses re to look for matches but I'm not sure if it's the best way.

import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)

finder = re.compile(regex_string)
string_to_be_searched = "one two three"

results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
    if word in result_set:
        print("%s in string" % word)

Problems in my solution:

It will search until the end of the string, although the words may appear in the first half of the string
In order to overcome the limitation of lookahead assertion (I don't know how to express "the character before current match should be non-word characters, or the start of the string"), I added extra space before and after the string I need to be searched.
Other performance issue introduced by the lookahead assertion?

Possible simpler implementation:

just loop through the word list and do a if word in string_to_be_searched. But it can not deal with "threesome" if you are looking for "three"
Use one regular expression search for one word. Still I'm not sure about the performance, and the potential of searching string multiple times.

UPDATE:

I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view.

Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: word_list = ["one day", "second day"]. Maybe I should ask another question.

381

asked Feb 12 '14 04:02

yegle

2 Answers

Easy way:

filter(lambda x:x in string,search_list)

if you want the search to ignore character's case you can do this:

lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)

if you want to ignore words that are part of bigger word such as three in threesome:

lower_string=string.lower()
result=[]
if ' ' in lower_string:
    result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
    substr=lower_string[:lower_string.find(' ')]
    if substr in search_list and substr not in result:
        result+=[substr]
    substr=lower_string[lower_string.rfind(' ')+1:]
    if substr in search_list and substr not in result:
        result+=[substr]
else:
    if lower_string in search_list:
        result=[lower_string]

If performance is needed:

arr=string.split(' ')
result=list(set(arr).intersection(set(search_list)))

EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster.

if string is too long you should do low level search and avoid converting it to list:

def safe_remove(arr,elem):
    try:
        arr.remove(elem)
    except:
        pass

not_found=search_list[:]
i=string.find(' ')
j=string.find(' ',i+1)
safe_remove(not_found,string[:i])
while j!=-1:
    safe_remove(not_found,string[i+1:j])
    i,j=j,string.find(' ',j+1)
safe_remove(not_found,string[i+1:])

not_found list contains words that are not found, you can get the found list easily, one way is list(set(search_list)-set(not_found))

EDIT: the last method appears to be the slowest.

153

answered Sep 18 '22 08:09

MIE

This function was found by Peter Gibson (below) to be the most performant of the answers here. It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):

def words_in_string(word_list, a_string):
    return set(word_list).intersection(a_string.split())

Usage:

my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
    print('One or more words found!')

Which prints One or words found! to stdout.

It does return the actual words found:

for word in words_in_string(my_word_list, a_string):
    print(word)

Prints out:

three
two
one

For data so large you can't hold it in memory, the solution given in this answer would be very performant.

answered Sep 18 '22 08:09

Russia Must Remove Putin

Related questions
                            
                                How to save an XML file to disk with python?
                            
                                Why does my Python3 script balk at piping its output to head or tail (sys module)?
                            
                                OpenCV via python: Is there a fast way to zero pixels outside a set of rectangles?
                            
                                How to setup APScheduler in a Django project?
                            
                                How do you create a Button on a tkinter Canvas?
                            
                                Parsing variable data out of a javascript tag using python
                            
                                Python Tkinter: Attempt to get widget size
                            
                                How to set TCP_NODELAY flag when loading URL with urllib2?
                            
                                How can a Flask decorator have arguments?
                            
                                Get the mimetype of a file with Python
                            
                                PyCrypto - How does the Initialization Vector work?
                            
                                Set chrome.prefs with python binding for selenium in chromedriver
                            
                                Symmetric streamplot with matplotlib
                            
                                How can I make Selenium/Python wait for the user to login before continuing to run?
                            
                                Unable to create models on Flask-admin
                            
                                Pandas date_range from DatetimeIndex to Date format
                            
                                convert sqlalchemy query result to a list of dicts
                            
                                Python: How do I use DictReader twice?
                            
                                Find and draw regression plane to a set of points
                            
                                Query to check if size of collection is 0 or empty in SQLAlchemy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With