Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word Frequency in text using Python but disregard stop words

This gives me a frequency of words in a text:

 fullWords = re.findall(r'\w+', allText)

 d = defaultdict(int)

 for word in fullWords :
          d[word] += 1

 finalFreq = sorted(d.iteritems(), key = operator.itemgetter(1), reverse=True)

 self.response.out.write(finalFreq)

This also gives me useless words like "the" "an" "a"

My question is, is there a stop words library available in python which can remove all these common words? I want to run this on google app engine

like image 615
demos Avatar asked Dec 28 '22 11:12

demos


2 Answers

You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv format, easily read with the csv module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.

like image 149
Alex Martelli Avatar answered Jan 14 '23 13:01

Alex Martelli


There's an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):

stopWords = set(['a', 'an', 'the', ...])
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords:
    if word not in stopWords:
        d[word] += 1
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
self.response.out.write(finalFreq)

This approach constructs the sorted list in two steps: first it filters out any words in your desired list of "stop words" (which has been converted to a set for efficiency), then it sorts the remaining entries.

like image 29
David Z Avatar answered Jan 14 '23 14:01

David Z