I am going through this wonderful tutorial.
I downloaded a collection called book
:
>>> import nltk
>>> nltk.download()
and imported texts:
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
I can then run commands on these texts:
>>> text1.concordance("monstrous")
How can I run these nltk commands on my own dataset? Are these collections the same as the object book
in python?
A concordance view shows us every occurrence of a given word, together with some context.
We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read: >>> from nltk. corpus import brown >>> brown.
You're right that it's quite hard to find the documentation for the book.py
module. So we have to get our hands dirty and look at the code, (see here). Looking at the book.py
, to do the conoordance and all the fancy stuff with the book module:
Firstly you have to have your raw texts put into nltk's corpus
class, see Creating a new corpus with NLTK for more details.
Secondly you read the corpus words into the NLTK's Text
class. Then you could use the functions that you see in http://nltk.org/book/ch01.html
from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text
# For example, I create an example text file
text1 = '''
This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.
'''
text2 = '''
One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.
'''
# Creating the corpus
corpusdir = './mycorpus/'
with (corpusdir+'text1.txt','w') as fout:
fout.write(text1)
with (corpusdir+'text2.txt','w') as fout:
fout.write(text2, fout)
# Read the the example corpus into NLTK's corpus class.
mycorpus = PlaintextCorpusReader(corpusdir, '.*')
# Read the NLTK's corpus into NLTK's text class,
# where your book-like concoordance search is available
mytext = Text(mycorpus.words())
mytext.concoordance('foo')
NOTE: you can use other NLTK's CorpusReaders and even specify custom paragraph/sentence/word tokenizers and encoding but now, we'll stick to the default
Text Analysis with NLTK Cheatsheet from bogs.princeton.edu https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf
Working with your own texts:
Open a file for reading
file = open('myfile.txt')
Make sure you are in the correct directory before starting Python - or give the full path specification.
Read the file:
t = file.read()
Tokenize the text:
tokens = nltk.word_tokenize(t)
Convert to NLTK Text object:
text = nltk.Text(tokens)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With