Given these three list comprehensions, is there a more efficient way to do this rather than three deliberate sets? I believe that for loops in this case would probably be bad form but if I were to iterate over a large number of lines in rowsaslist I feel like what I have below is not that efficient.
cachedStopWords = stopwords.words('english')
rowsaslist = [x.lower() for x in rowsaslist]
rowsaslist = [''.join(c for c in s if c not in string.punctuation) for s in rowsaslist]
rowsaslist = [' '.join([word for word in p.split() if word not in cachedStopWords]) for p in rowsaslist]
Is combining these all into one comprehension statement more efficient? I know from a readability standpoint it would probably be a mess of code.
Instead of iterating 3 times on the same list, you could simply define 2 functions and use them in one single list comprehension:
cachedStopWords = stopwords.words('english')
def remove_punctuation(text):
return ''.join(c for c in text.lower() if c not in string.punctuation)
def remove_stop_words(text):
return ' '.join([word for word in p.split() if word not in cachedStopWords])
rowsaslist = [remove_stop_words(remove_punctuation(text)) for text in rowsaslist]
I've never used stopwords. If it returns a list, you'd better convert it to a set first to speed up the word not in cachedStopWords test.
Finally, the NLTK package might help you process text. See @alvas' answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With