Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

FreqDist using NLTK

I'm trying to get a frequency distribution of a set of documents using Python. My code isn't working for some reason and is producing this error:

Traceback (most recent call last):
  File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
    fd = FreqDist(corpus_text)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
    self.update(samples)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
    self.inc(sample, count=count)
  File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

Can you help?

This is the code so far:

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
    doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
    input = open(doc_name).read() 
    input = input.split()
    corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
    if w in stopwords:
        corpus_text.remove(w)

fd = FreqDist(corpus_text)
like image 860
AJS Avatar asked Jan 01 '26 12:01

AJS


1 Answers

Two thoughts that I hope at least contribute to an answer.

First, the documentation for the nltk.text.Text() method states (emphasis mine):

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

So I'm not sure Text() is the way you want to handle this data. It seems to me you would do just fine to use a list.

Second, I would caution you to think about the calculation you're asking NLTK to perform here. Removing stopwords before determining a frequency distribution means that the your frequencies will be skewed; I do not understand why the stopwords are removed before tabulation rather than just ignored in examining the distribution after the fact. (I suppose this second point would make a better query/comment than part of an answer, but I felt it worth pointing out that the proportions would be skewed.) Depending on what you intend to use the frequency distribution for, this may or may not be a problem in and of itself.

like image 139
dmh Avatar answered Jan 03 '26 02:01

dmh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!