Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP - information extraction in Python (spaCy)

I am attempting to extract this type of information from the following paragraph structure:

 women_ran men_ran kids_ran walked
         1       2        1      3
         2       4        3      1
         3       6        5      2

text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."]

I am using Python's spaCy as my NLP library. I am newer to NLP work and am hoping for some guidance as to what would be the best way to extract this tabular information from such sentences.

If it was simply a matter of identifying whether there were individuals running or walking, I would just use sklearn to fit a classification model, but the information that I need to extract is obviously more granular than that (I am trying to retrieve subcategories and values for each). Any guidance would be greatly appreciated.

like image 572
kathystehl Avatar asked Nov 06 '16 19:11

kathystehl


1 Answers

You'll want to use the dependency parse for this. You can see a visualisation of your example sentence using the displaCy visualiser.

You could implement the rules you need a few different ways — much like how there are always multiple ways to write an XPath query, DOM selector, etc.

Something like this should work:

nlp = spacy.load('en')
docs = [nlp(t) for t in text]
for i, doc in enumerate(docs):
    for j, sent in enumerate(doc.sents):
        subjects = [w for w in sent if w.dep_ == 'nsubj']
        for subject in subjects:
            numbers = [w for w in subject.lefts if w.dep_ == 'nummod']
            if len(numbers) == 1:
                print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text))

For your examples in text you should get:

document.sentence: 0.0, subject: men, action: ran, numbers: 2
document.sentence: 0.0, subject: child, action: ran, numbers: 1
document.sentence: 0.1, subject: people, action: walking, numbers: 3
document.sentence: 1.0, subject: person, action: walking, numbers: One
like image 111
syllogism_ Avatar answered Oct 15 '22 17:10

syllogism_