I've extracted keywords based on 1-gram, 2-gram, 3-gram within a tokenized sentence
list_of_keywords = []
for i in range(0, len(stemmed_words)):
temp = []
for j in range(0, len(stemmed_words[i])):
temp.append([' '.join(x) for x in list(everygrams(stemmed_words[i][j], 1, 3)) if ' '.join(x) in set(New_vocabulary_list)])
list_of_keywords.append(temp)
I've obtained keywords list as
['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
['sleep', 'anxiety', 'lack of sleep']
How can I simply the results by removing all substring within the list and remain:
['high blood pressure']
['anxiety', 'lack of sleep']
Use any() function to check if a list contains a substring in Python. The any(iterable) with iterable as a for-loop that checks if any element in the list contains the substring and returns the Boolean value.
Method #2 : Using any() The any function can be used to compute the presence of the test substring in all the strings of the list and return True if it's found in any. This is better than the above function as it doesn't explicitly take space to create new concatenated string.
To check if the list contains an element in Python, use the “in” operator. The “in” operator checks if the list contains a specific item or not. It can also check if the element exists on the list or not using the list. count() function.
We can use the in-built python List method, count(), to check if the passed element exists in the List. If the passed element exists in the List, the count() method will show the number of times it occurs in the entire list. If it is a non-zero positive number, it means an element exists in the List.
You could use this one liner:
b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i] )]
I admit this is O(n2) and maybe will be slow in performance for large inputs.
This is basically a list comprehension of the following:
word_list = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = []
for this_word in word_list:
words_without_this_word = [ other_word for other_word in word_list if other_word != this_word]
found = False
for other_word in words_without_this_word:
if this_word in other_word:
found = True
if not found:
result.append(this_word)
result
If you have a large list of words, it might be a good idea to use a suffix tree.
Here's a package on PyPI.
Once you created the tree, you can call find_all(word)
to get the index of every occurence of word
. You simply need to keep the strings which only appear once:
from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees
words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)
st.find_all('blood')
# [0, 20, 26, 46]
st.find_all('high blood pressure')
# [41]
[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']
words
needs to be a unique list of strings, so you might need to call list(set(words))
before generating the suffix-tree.
As far as I can tell, the whole script should run in O(n)
, with n
being the total length of the strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With