I am trying to import sentences from Shakespeare's NLTK corpus – following this help site – but I am having trouble getting access to the sentences (in order to train a word2vec model) :
from nltk.corpus import shakespeare #XMLCorpusreader
shakespeare.fileids()
['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', ...]
play = shakespeare.xml('dream.xml') #ElementTree object
print(play)
<Element 'PLAY' at ...>
for i in range(9):
print('%s: %s' % (play[i].tag, play[i].text))
Returns the following :
TITLE: A Midsummer Night's Dream
PERSONAE:
SCNDESCR: SCENE Athens, and a wood near it.
PLAYSUBT: A MIDSUMMER NIGHT'S DREAM
ACT: None
ACT: None
ACT: None
ACT: None
ACT: None
Why are all the acts None ?
None of the methods defined here (http://www.nltk.org/howto/corpus.html#data-access-methods) (.sents(), tagged_sents(), chunked_sents(), parsed_sents()) seem to work when applied to the shakespeare XMLCorpusReader
I'd like to understand :
1/ how to get the sentences
2/ how to know how to look for them in an ElementTree object
The question boils down to how to extract text from all children of an element tree. This is quite duplicate to Python element tree - extract text from element, stripping tags
Try this:
for p in play:
print('%s: %s' % (p.tag, list(p.itertext())))
Insert the logic here what you want to do
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With