Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sentence Structure identification - spacy

I intend to identify the sentence structure in English using spacy and textacy.

For example: The cat sat on the mat - SVO , The cat jumped and picked up the biscuit - SVV0. The cat ate the biscuit and cookies. - SVOO.

The program is supposed to read a paragraph and return the output for each sentence as SVO, SVOO, SVVO or other custom structures.

Efforts so far:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"] 
VERB = ["ROOT"] 
OBJ = ["dobj", "pobj", "dobj"] 
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)

Output:

(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
  • Issue 1: The SVO are overwritten. Why?
  • Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?

Edit 1:

Some approach I was conceptualizing.

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
    print "SVO not identified"
elif result == True: # shouldn't do this
    print "SVO"
else:
    print "Others..."

Edit 2:

Made further inroads

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))

Current output:

det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct

Expected output:

SVO SVVO SVOO

Idea is to break down dependency tags to simple subject-verb and object model.

Thinking of achieving it with regex if no other options are available. But that is my last option.

Edit 3:

After studying this link, got some improvement.

def testSVOs():
    nlp = en_core_web_sm.load()
    tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
    svos = findSVOs(tok)
    print(svos)

Current output:

[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]

Expected output:

I am expecting a notation for the sentences. Although I'm able to extract the SVO on how to convert it into SVO notation. It is more of pattern identification rather than the sentence content itself.

SVO SVO SVOO
like image 248
Programmer_nltk Avatar asked Mar 24 '18 00:03

Programmer_nltk


1 Answers

Issue 1: The SVO are overwritten. Why?

This is textacy issue. This part is not working very well, see this blog

Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?

You should parse the dependency tree. SpaCy provides the information, you just need to write a set of rules to extract it out, using .head, .left, .right and .children attributes.

>>for word in text: 
    print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))

        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN sat 
        sat   VBD       ROOT       VERB sat 
         on    IN       prep        ADP sat 
        the    DT        det        DET mat
        mat    NN       pobj       NOUN on 
          .     .      punct      PUNCT sat 
         of    IN       ROOT        ADP of 
        the    DT        det        DET lab
        art    NN   compound       NOUN lab
        lab    NN       pobj       NOUN of 
          .     .      punct      PUNCT of 
        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN jumped 
     jumped   VBD       ROOT       VERB jumped 
        and    CC         cc      CCONJ jumped 
     picked   VBD       conj       VERB jumped 
         up    RP        prt       PART picked 
        the    DT        det        DET biscuit
    biscuit    NN       dobj       NOUN picked 
          .     .      punct      PUNCT jumped 
        The    DT        det        DET cat 
        cat    NN      nsubj       NOUN ate 
        ate   VBD       ROOT       VERB ate 
    biscuit    NN       dobj       NOUN ate 
        and    CC         cc      CCONJ biscuit 
    cookies   NNS       conj       NOUN biscuit 
          .     .      punct      PUNCT ate 

I recommend you look at this code, just add pobj to the list of OBJECTS, and you will get your SVO and SVOO covered. With a little fiddling you can get SVVO also.

like image 89
igrinis Avatar answered Oct 17 '22 02:10

igrinis