Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python NLTK Shakespeare corpus

Tags:

python

nlp

nltk

I am trying to import sentences from Shakespeare's NLTK corpus – following this help site – but I am having trouble getting access to the sentences (in order to train a word2vec model) :

from nltk.corpus import shakespeare #XMLCorpusreader
shakespeare.fileids()
['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', ...]

play = shakespeare.xml('dream.xml') #ElementTree object
print(play)
<Element 'PLAY' at ...>

for i in range(9):
    print('%s: %s' % (play[i].tag, play[i].text))

Returns the following :

TITLE: A Midsummer Night's Dream
PERSONAE: 

SCNDESCR: SCENE  Athens, and a wood near it.
PLAYSUBT: A MIDSUMMER NIGHT'S DREAM
ACT: None
ACT: None
ACT: None
ACT: None
ACT: None

Why are all the acts None ?

None of the methods defined here (http://www.nltk.org/howto/corpus.html#data-access-methods) (.sents(), tagged_sents(), chunked_sents(), parsed_sents()) seem to work when applied to the shakespeare XMLCorpusReader

I'd like to understand :
1/ how to get the sentences

2/ how to know how to look for them in an ElementTree object

like image 957
Romain G Avatar asked Oct 18 '22 13:10

Romain G


1 Answers

The question boils down to how to extract text from all children of an element tree. This is quite duplicate to Python element tree - extract text from element, stripping tags

Try this:

for p in play:
    print('%s: %s' % (p.tag, list(p.itertext())))

Insert the logic here what you want to do

like image 54
David Michael Gang Avatar answered Oct 21 '22 03:10

David Michael Gang