This is a follow-up of my question. I am using nltk to parse out persons, organizations, and their relationships. Using this example, I was able to create chunks of persons and organizations; however, I am getting an error in the nltk.sem.extract_rel command:
AttributeError: 'Tree' object has no attribute 'text'
Here is the complete code:
import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)
# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]
# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+\s+as\s+')
for doc in chunked_sentences:
for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
print nltk.sem.show_raw_rtuple(rel)
This example is very similar to the one given in the book, but the example uses prepared 'parsed docs,' which appears of nowhere and I don't know where to find its object type. I scoured thru the git libraries as well. Any help is appreciated.
My ultimate goal is to extract persons, organizations, titles (dates) for some companies; then create network maps of persons and organizations.
It looks like to be a "Parsed Doc" an object needs to have a headline
member and a text
member both of which are lists of tokens, where some of the tokens are marked up as trees. For example this (hacky) example works:
import nltk
import re
IN = re.compile (r'.*\bin\b(?!\b.+ing)')
class doc():
pass
doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']
for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
print nltk.sem.relextract.show_raw_rtuple(rel)
When run this provides the output:
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
Obviously you wouldn't actually code it like this, but it provides a working example of the data format expected by extract_rels
, you just need to determine how to do your preprocessing steps to get your data massaged into that format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With