Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract relationship from text in NLTK

Tags:

nlp

nltk

Hi I'm trying to extract relationships from a string of text based on the second last example here: https://web.archive.org/web/20120907184244/http://nltk.googlecode.com/svn/trunk/doc/howto/relextract.html

From a string such as "Michael James editor of Publishers Weekly" my desired result is to have an output such as:

[PER: 'Michael James'] ', editor of' [ORG: 'Publishers Weekly']

What is the best way to do do this? What format does extract_rels expect and how do I format my input to meet that requirement?


Tried to do it myself but it didn't work. Here is the code I've adapted from the book. I'm not getting any results printed. What am I doing wrong?

class doc():
 pass

doc.headline = ['this is expected by nltk.sem.extract_rels but not used in this script']

def findrelations(text):
roles = """
(.*(                   
analyst|
editor|
librarian).*)|
researcher|
spokes(wo)?man|
writer|
,\sof\sthe?\s*  # "X, of (the) Y"
"""
ROLES = re.compile(roles, re.VERBOSE)
tokenizedsentences = nltk.sent_tokenize(text)
for sentence in tokenizedsentences:
    taggedwords  = nltk.pos_tag(nltk.word_tokenize(sentence))
    doc.text = nltk.batch_ne_chunk(taggedwords)
    print doc.text
    for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=ROLES):
        print relextract.show_raw_rtuple(rel) # doctest: +ELLIPSIS

text ="Michael James editor of Publishers Weekly"

findrelations(text)

like image 798
CraigH Avatar asked Sep 04 '12 13:09

CraigH


People also ask

What is rule-based relation extraction?

The rule-based approach is an approach that is suitable for the type of IE the study is interested in (i.e. theme extraction), whereas more complex information understanding would require more complex approaches.

What is NLTK Ne_chunk?

chunk package. Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”.


1 Answers

here a code based on yours (just few adjusts) that work well ;)

import nltk
import re 
from nltk.chunk import ne_chunk_sents
from nltk.sem import relextract


def findrelations(text):
    roles = """
    (.*(                   
    analyst|
    editor|
    librarian).*)|
    researcher|
    spokes(wo)?man|
    writer|
    ,\sof\sthe?\s*  # "X, of (the) Y"
    """
    ROLES = re.compile(roles, re.VERBOSE)

    sentences = nltk.sent_tokenize(text)
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences)


    for doc in chunked_sentences:
        print doc
        for rel in relextract.extract_rels('PER', 'ORG', doc, corpus='ace', pattern=ROLES):
            #it is a tree, so you need to work on it to output what you want
            print relextract.show_raw_rtuple(rel) 

findrelations('Michael James editor of Publishers Weekly')

(S (PERSON Michael/NNP) (PERSON James/NNP) editor/NN of/IN (ORGANIZATION Publishers/NNS Weekly/NNP))

like image 160
Vinicius Woloszyn Avatar answered Oct 10 '22 03:10

Vinicius Woloszyn