I'm a Literature grad student, and I've been going through the O'Reilly book in Natural Language Processing (nltk.org/book). It looks incredibly useful. I've played around with all the example texts and example tasks in Chapter 1, like concordances. I now know how many times Moby Dick uses the word "whale." The problem is, I can't figure out how to do these calculations on one of my own texts. I've found information on how to create my own corpora (Ch. 2 of the O'Reilly book), but I don't think that's exactly what I want to do. In other words, I want to be able to do
import nltk text1.concordance('yellow')
and get the places where the word 'yellow' is used in my text. At the moment I can do this with the example texts, but not my own.
I'm very new to python and programming, and so this stuff is very exciting, but very confusing.
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output: >>> from nltk.
Found the answer myself. That's embarrassing. Or awesome.
From Ch. 3:
f=open('my-file.txt','rU') raw=f.read() tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens)
Does the trick.
For a structured import of multiple files:
from nltk.corpus import PlaintextCorpusReader # RegEx or list of file names files = ".*\.txt" corpus0 = PlaintextCorpusReader("/path/", files) corpus = nltk.Text(corpus0.words())
see: NLTK 3 book / section 1.9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With