Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: how to determine if a list of words exist in a string

Tags:

python

regex

Given a list ["one", "two", "three"], how to determine if each word exist in a specified string?

The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)

My current implementation uses re to look for matches but I'm not sure if it's the best way.

import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)

finder = re.compile(regex_string)
string_to_be_searched = "one two three"

results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
    if word in result_set:
        print("%s in string" % word)

Problems in my solution:

  1. It will search until the end of the string, although the words may appear in the first half of the string
  2. In order to overcome the limitation of lookahead assertion (I don't know how to express "the character before current match should be non-word characters, or the start of the string"), I added extra space before and after the string I need to be searched.
  3. Other performance issue introduced by the lookahead assertion?

Possible simpler implementation:

  1. just loop through the word list and do a if word in string_to_be_searched. But it can not deal with "threesome" if you are looking for "three"
  2. Use one regular expression search for one word. Still I'm not sure about the performance, and the potential of searching string multiple times.

UPDATE:

I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view.

Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: word_list = ["one day", "second day"]. Maybe I should ask another question.

like image 381
yegle Avatar asked Feb 12 '14 04:02

yegle


People also ask

How do you see if a list of words is in a string Python?

Using any() to check if string contains element from list. Using any function is the most classical way in which you can perform this task and also efficiently. This function checks for match in string with match of each element of list.

How do you check if all items in a list are in a string Python?

Just use all() and check for types with isinstance() .

How do you check if all items in a list are in a string?

Use the all() function to check if a list contains only strings, e.g. if all(isinstance(item, str) for item in my_list): . The all() function will return True if the list contains only strings and False otherwise.

How do you check if a list of string contains a string?

To check if string contains substring from a list of strings, iterate over list of strings, and for each item in the list, check if the item is present in the given string. a source string, in which we have to check if some substring is present.


2 Answers

Easy way:

filter(lambda x:x in string,search_list)

if you want the search to ignore character's case you can do this:

lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)

if you want to ignore words that are part of bigger word such as three in threesome:

lower_string=string.lower()
result=[]
if ' ' in lower_string:
    result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
    substr=lower_string[:lower_string.find(' ')]
    if substr in search_list and substr not in result:
        result+=[substr]
    substr=lower_string[lower_string.rfind(' ')+1:]
    if substr in search_list and substr not in result:
        result+=[substr]
else:
    if lower_string in search_list:
        result=[lower_string]


If performance is needed:
arr=string.split(' ')
result=list(set(arr).intersection(set(search_list)))

EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster.


if string is too long you should do low level search and avoid converting it to list:
def safe_remove(arr,elem):
    try:
        arr.remove(elem)
    except:
        pass

not_found=search_list[:]
i=string.find(' ')
j=string.find(' ',i+1)
safe_remove(not_found,string[:i])
while j!=-1:
    safe_remove(not_found,string[i+1:j])
    i,j=j,string.find(' ',j+1)
safe_remove(not_found,string[i+1:])

not_found list contains words that are not found, you can get the found list easily, one way is list(set(search_list)-set(not_found))

EDIT: the last method appears to be the slowest.

like image 153
MIE Avatar answered Sep 18 '22 08:09

MIE


This function was found by Peter Gibson (below) to be the most performant of the answers here. It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):

def words_in_string(word_list, a_string):
    return set(word_list).intersection(a_string.split())

Usage:

my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
    print('One or more words found!')

Which prints One or words found! to stdout.

It does return the actual words found:

for word in words_in_string(my_word_list, a_string):
    print(word)

Prints out:

three
two
one

For data so large you can't hold it in memory, the solution given in this answer would be very performant.

like image 20
Russia Must Remove Putin Avatar answered Sep 18 '22 08:09

Russia Must Remove Putin