I've extracted keywords based on 1-gram, 2-gram, 3-gram within a tokenized sentence <pre class="prettyprint"><code>list_of_keywords = [] for i in range(0, len(stemmed_words)): temp = [] for j in range(0, len(stemmed_words[i])): temp.append([' '.join(x) for x in list(everygrams(stemmed_words[i][j], 1, 3)) if ' '.join(x) in set(New_vocabulary_list)]) list_of_keywords.append(temp) </code></pre> I've obtained keywords list as <pre class="prettyprint"><code>['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] ['sleep', 'anxiety', 'lack of sleep'] </code></pre> How can I simply the results by removing all substring within the list and remain: <pre class="prettyprint"><code>['high blood pressure'] ['anxiety', 'lack of sleep'] </code></pre>

You could use this one liner: <pre class="prettyprint"><code>b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] result = [ i for i in b if not any( [ i in a for a in b if a != i] )] </code></pre> I admit this is O(n2) and maybe will be slow in performance for large inputs. This is basically a list comprehension of the following: <pre class="prettyprint"><code>word_list = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] result = [] for this_word in word_list: words_without_this_word = [ other_word for other_word in word_list if other_word != this_word] found = False for other_word in words_without_this_word: if this_word in other_word: found = True if not found: result.append(this_word) result </code></pre>

Python: Check if string and its substring are existing in the same list

Tags:

python

nlp

I've extracted keywords based on 1-gram, 2-gram, 3-gram within a tokenized sentence

list_of_keywords = []
for i in range(0, len(stemmed_words)):
    temp = []
    for j in range(0, len(stemmed_words[i])):
        temp.append([' '.join(x) for x in list(everygrams(stemmed_words[i][j], 1, 3)) if ' '.join(x) in set(New_vocabulary_list)])
    list_of_keywords.append(temp)

I've obtained keywords list as

['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
['sleep', 'anxiety', 'lack of sleep']

How can I simply the results by removing all substring within the list and remain:

['high blood pressure']
['anxiety', 'lack of sleep']

679

asked Mar 15 '19 09:03

Lisa

Video Answer

2 Answers

You could use this one liner:

b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i]   )]

I admit this is O(n²) and maybe will be slow in performance for large inputs.

This is basically a list comprehension of the following:

word_list =  ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']

result = []
for this_word in word_list:
    words_without_this_word = [ other_word  for other_word in word_list if other_word != this_word]  
    found = False
    for other_word in words_without_this_word:
        if this_word in other_word:
            found = True

    if not found:
        result.append(this_word)

result

answered Oct 19 '22 21:10

Christian Sloper

If you have a large list of words, it might be a good idea to use a suffix tree.

Here's a package on PyPI.

Once you created the tree, you can call find_all(word) to get the index of every occurence of word. You simply need to keep the strings which only appear once:

from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees

words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)

st.find_all('blood')
# [0, 20, 26, 46]

st.find_all('high blood pressure')
# [41]

[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']

words needs to be a unique list of strings, so you might need to call list(set(words)) before generating the suffix-tree.

As far as I can tell, the whole script should run in O(n), with n being the total length of the strings.

answered Oct 19 '22 22:10

Eric Duminil

Related questions
                            
                                How can I set a default per test timeout in pytest?
                            
                                What is the most efficient way to copy an externally provided buffer to bytes
                            
                                How can one uninstall virtualenvwrapper?
                            
                                psycopg2.DatabaseError: SSL SYSCALL error: Connection timed out
                            
                                Selenium/python: extract text from a dynamically-loading webpage after every scroll
                            
                                Where is pip's cache in a virtualenv?
                            
                                Building Progressive Web Apps using Python Flask
                            
                                Running python script with Numpy and OpenCV on Android
                            
                                Exporting spark dataframe to .csv with header and specific filename
                            
                                python3 openCV install error: Symbol not found: _clock_gettime
                            
                                Can I access a nested dict with a list of keys?
                            
                                Python: Create Abstract Static Property within Class
                            
                                Matplotlib and Networkx - drawing a self loop node
                            
                                How to pass additional parameters to numba cfunc passed as LowLevelCallable to scipy.integrate.quad
                            
                                Wrong dashboard while adding flask-admin to project
                            
                                Python multiprocessing performance only improves with the square root of the number of cores used
                            
                                Celery + Redis tasks in different files
                            
                                How to properly use get_keras_embedding() in Gensim’s Word2Vec?
                            
                                Importing tensorflow makes python 3.6.5 error
                            
                                Is there a way to call an `async` python method from C++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With