Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How do I create my own NLTK text from a text file?




I'm a Literature grad student, and I've been going through the O'Reilly book in Natural Language Processing (nltk.org/book). It looks incredibly useful. I've played around with all the example texts and example tasks in Chapter 1, like concordances. I now know how many times Moby Dick uses the word "whale." The problem is, I can't figure out how to do these calculations on one of my own texts. I've found information on how to create my own corpora (Ch. 2 of the O'Reilly book), but I don't think that's exactly what I want to do. In other words, I want to be able to do

import nltk  text1.concordance('yellow') 

and get the places where the word 'yellow' is used in my text. At the moment I can do this with the example texts, but not my own.

I'm very new to python and programming, and so this stuff is very exciting, but very confusing.

like image 822
Jonathan Avatar asked May 06 '12 00:05


People also ask

How do you create a text corpus in Python?

Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output: >>> from nltk.

2 Answers

Found the answer myself. That's embarrassing. Or awesome.

From Ch. 3:

f=open('my-file.txt','rU') raw=f.read() tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) 

Does the trick.

like image 180
Jonathan Avatar answered Oct 06 '22 23:10


For a structured import of multiple files:

from nltk.corpus import PlaintextCorpusReader  # RegEx or list of file names files = ".*\.txt"  corpus0 = PlaintextCorpusReader("/path/", files) corpus  = nltk.Text(corpus0.words()) 

see: NLTK 3 book / section 1.9

like image 44
Raffael Avatar answered Oct 06 '22 23:10
