Python NLTK Shakespeare corpus

Question

I am trying to import sentences from Shakespeare's NLTK corpus – following this help site – but I am having trouble getting access to the sentences (in order to train a word2vec model) :

from nltk.corpus import shakespeare #XMLCorpusreader
shakespeare.fileids()
['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', ...]

play = shakespeare.xml('dream.xml') #ElementTree object
print(play)
<Element 'PLAY' at ...>

for i in range(9):
    print('%s: %s' % (play[i].tag, play[i].text))

Returns the following :

TITLE: A Midsummer Night's Dream
PERSONAE: 

SCNDESCR: SCENE  Athens, and a wood near it.
PLAYSUBT: A MIDSUMMER NIGHT'S DREAM
ACT: None
ACT: None
ACT: None
ACT: None
ACT: None

Why are all the acts None ?

None of the methods defined here (http://www.nltk.org/howto/corpus.html#data-access-methods) (.sents(), tagged_sents(), chunked_sents(), parsed_sents()) seem to work when applied to the shakespeare XMLCorpusReader

I'd like to understand :
1/ how to get the sentences

2/ how to know how to look for them in an ElementTree object

David Michael Gang · Accepted Answer

The question boils down to how to extract text from all children of an element tree. This is quite duplicate to Python element tree - extract text from element, stripping tags

Try this:

for p in play:
    print('%s: %s' % (p.tag, list(p.itertext())))

Insert the logic here what you want to do

Python NLTK Shakespeare corpus

Tags:

python

nlp

nltk

Romain G

1 Answers

David Michael Gang

Recent Activity

Donate For Us

Python NLTK Shakespeare corpus

Tags:

python

nlp

nltk

Romain G

1 Answers

David Michael Gang

Related questions

Recent Activity

Donate For Us