Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Check if string and its substring are existing in the same list

Tags:

python

nlp

I've extracted keywords based on 1-gram, 2-gram, 3-gram within a tokenized sentence

list_of_keywords = []
for i in range(0, len(stemmed_words)):
    temp = []
    for j in range(0, len(stemmed_words[i])):
        temp.append([' '.join(x) for x in list(everygrams(stemmed_words[i][j], 1, 3)) if ' '.join(x) in set(New_vocabulary_list)])
    list_of_keywords.append(temp)

I've obtained keywords list as

['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
['sleep', 'anxiety', 'lack of sleep']

How can I simply the results by removing all substring within the list and remain:

['high blood pressure']
['anxiety', 'lack of sleep']
like image 679
Lisa Avatar asked Mar 15 '19 09:03

Lisa


People also ask

How do you check if a substring exists in a list of strings Python?

Use any() function to check if a list contains a substring in Python. The any(iterable) with iterable as a for-loop that checks if any element in the list contains the substring and returns the Boolean value.

How do I check if a string is substring of a list of strings?

Method #2 : Using any() The any function can be used to compute the presence of the test substring in all the strings of the list and return True if it's found in any. This is better than the above function as it doesn't explicitly take space to create new concatenated string.

How do you check if a string is already in a list Python?

To check if the list contains an element in Python, use the “in” operator. The “in” operator checks if the list contains a specific item or not. It can also check if the element exists on the list or not using the list. count() function.

How do you check if something exists in a list Python?

We can use the in-built python List method, count(), to check if the passed element exists in the List. If the passed element exists in the List, the count() method will show the number of times it occurs in the entire list. If it is a non-zero positive number, it means an element exists in the List.


Video Answer


2 Answers

You could use this one liner:

b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i]   )]

I admit this is O(n2) and maybe will be slow in performance for large inputs.

This is basically a list comprehension of the following:

word_list =  ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

result = []
for this_word in word_list:
    words_without_this_word = [ other_word  for other_word in word_list if other_word != this_word]  
    found = False
    for other_word in words_without_this_word:
        if this_word in other_word:
            found = True

    if not found:
        result.append(this_word)

result
like image 50
Christian Sloper Avatar answered Oct 19 '22 21:10

Christian Sloper


If you have a large list of words, it might be a good idea to use a suffix tree.

Here's a package on PyPI.

Once you created the tree, you can call find_all(word) to get the index of every occurence of word. You simply need to keep the strings which only appear once:

from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees

words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)

st.find_all('blood')
# [0, 20, 26, 46]

st.find_all('high blood pressure')
# [41]

[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']

words needs to be a unique list of strings, so you might need to call list(set(words)) before generating the suffix-tree.

As far as I can tell, the whole script should run in O(n), with n being the total length of the strings.

like image 1
Eric Duminil Avatar answered Oct 19 '22 22:10

Eric Duminil