I am trying to work on subject extraction in a sentence, so that I can get the sentiments in accordance with the subject. I am using <code>nltk</code> in python2.7 for this purpose. Take the following sentence as an example: <code>Donald Trump is the worst president of USA, but Hillary is better than him</code> He we can see that <code>Donald Trump</code> and <code>Hillary</code> are the two subjects, and sentiments related to <code>Donald Trump</code> is negative but related to <code>Hillary</code> are positive. Till now, I am able to break this sentence into chunks of noun phrases, and I am able to get the following: <pre class="prettyprint"><code>(S (NP Donald/NNP Trump/NNP) is/VBZ (NP the/DT worst/JJS president/NN) in/IN (NP USA,/NNP) but/CC (NP Hillary/NNP) is/VBZ better/JJR than/IN (NP him/PRP)) </code></pre> Now, how do I approach in finding the subjects from these noun phrases? Then how do I group the phrases meant for both the subjects together? Once I have the phrases meant for both the subjects separately, I can perform sentiment analysis on both of them separately. EDIT I looked into the library mentioned by @Krzysiek (<code>spacy</code>), and it gave me dependency trees as well in the sentences. Here is the code: <pre class="prettyprint"><code>from spacy.en import English parser = English() example = u"Donald Trump is the worst president of USA, but Hillary is better than him" parsedEx = parser(example) # shown as: original token, dependency tag, head word, left dependents, right dependents for token in parsedEx: print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights]) </code></pre> Here are the dependency trees: <pre class="prettyprint"><code>(u'Donald', u'compound', u'Trump', [], []) (u'Trump', u'nsubj', u'is', [u'Donald'], []) (u'is', u'ROOT', u'is', [u'Trump'], [u'president', u',', u'but', u'is']) (u'the', u'det', u'president', [], []) (u'worst', u'amod', u'president', [], []) (u'president', u'attr', u'is', [u'the', u'worst'], [u'of']) (u'of', u'prep', u'president', [], [u'USA']) (u'USA', u'pobj', u'of', [], []) (u',', u'punct', u'is', [], []) (u'but', u'cc', u'is', [], []) (u'Hillary', u'nsubj', u'is', [], []) (u'is', u'conj', u'is', [u'Hillary'], [u'better']) (u'better', u'acomp', u'is', [], [u'than']) (u'than', u'prep', u'better', [], [u'him']) (u'him', u'pobj', u'than', [], []) </code></pre> This gives in depth insights into the dependencies of the different tokens of the sentences. Here is the link to the paper which describes the dependencies between different pairs. How can I use this tree to attach the contextual words for different subjects to them?

I was going through spacy library more, and I finally figured out the solution through dependency management. Thanks to this repo, I figured out how to include adjectives as well in my subjective verb object (making it SVAO's), as well as taking out compound subjects in the query. Here goes my solution: <pre class="prettyprint"><code>from nltk.stem.wordnet import WordNetLemmatizer from spacy.lang.en import English SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"] OBJECTS = ["dobj", "dative", "attr", "oprd"] ADJECTIVES = ["acomp", "advcl", "advmod", "amod", "appos", "nn", "nmod", "ccomp", "complm", "hmod", "infmod", "xcomp", "rcmod", "poss"," possessive"] COMPOUNDS = ["compound"] PREPOSITIONS = ["prep"] def getSubsFromConjunctions(subs): moreSubs = [] for sub in subs: # rights is a generator rights = list(sub.rights) rightDeps = {tok.lower_ for tok in rights} if "and" in rightDeps: moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"]) if len(moreSubs) > 0: moreSubs.extend(getSubsFromConjunctions(moreSubs)) return moreSubs def getObjsFromConjunctions(objs): moreObjs = [] for obj in objs: # rights is a generator rights = list(obj.rights) rightDeps = {tok.lower_ for tok in rights} if "and" in rightDeps: moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"]) if len(moreObjs) > 0: moreObjs.extend(getObjsFromConjunctions(moreObjs)) return moreObjs def getVerbsFromConjunctions(verbs): moreVerbs = [] for verb in verbs: rightDeps = {tok.lower_ for tok in verb.rights} if "and" in rightDeps: moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"]) if len(moreVerbs) > 0: moreVerbs.extend(getVerbsFromConjunctions(moreVerbs)) return moreVerbs def findSubs(tok): head = tok.head while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head: head = head.head if head.pos_ == "VERB": subs = [tok for tok in head.lefts if tok.dep_ == "SUB"] if len(subs) > 0: verbNegated = isNegated(head) subs.extend(getSubsFromConjunctions(subs)) return subs, verbNegated elif head.head != head: return findSubs(head) elif head.pos_ == "NOUN": return [head], isNegated(tok) return [], False def isNegated(tok): negations = {"no", "not", "n't", "never", "none"} for dep in list(tok.lefts) + list(tok.rights): if dep.lower_ in negations: return True return False def findSVs(tokens): svs = [] verbs = [tok for tok in tokens if tok.pos_ == "VERB"] for v in verbs: subs, verbNegated = getAllSubs(v) if len(subs) > 0: for sub in subs: svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_)) return svs def getObjsFromPrepositions(deps): objs = [] for dep in deps: if dep.pos_ == "ADP" and dep.dep_ == "prep": objs.extend([tok for tok in dep.rights if tok.dep_ in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")]) return objs def getAdjectives(toks): toks_with_adjectives = [] for tok in toks: adjs = [left for left in tok.lefts if left.dep_ in ADJECTIVES] adjs.append(tok) adjs.extend([right for right in tok.rights if tok.dep_ in ADJECTIVES]) tok_with_adj = " ".join([adj.lower_ for adj in adjs]) toks_with_adjectives.extend(adjs) return toks_with_adjectives def getObjsFromAttrs(deps): for dep in deps: if dep.pos_ == "NOUN" and dep.dep_ == "attr": verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"] if len(verbs) > 0: for v in verbs: rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] objs.extend(getObjsFromPrepositions(rights)) if len(objs) > 0: return v, objs return None, None def getObjFromXComp(deps): for dep in deps: if dep.pos_ == "VERB" and dep.dep_ == "xcomp": v = dep rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] objs.extend(getObjsFromPrepositions(rights)) if len(objs) > 0: return v, objs return None, None def getAllSubs(v): verbNegated = isNegated(v) subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"] if len(subs) > 0: subs.extend(getSubsFromConjunctions(subs)) else: foundSubs, verbNegated = findSubs(v) subs.extend(foundSubs) return subs, verbNegated def getAllObjs(v): # rights is a generator rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] objs.extend(getObjsFromPrepositions(rights)) potentialNewVerb, potentialNewObjs = getObjFromXComp(rights) if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0: objs.extend(potentialNewObjs) v = potentialNewVerb if len(objs) > 0: objs.extend(getObjsFromConjunctions(objs)) return v, objs def getAllObjsWithAdjectives(v): # rights is a generator rights = list(v.rights) objs = [tok for tok in rights if tok.dep_ in OBJECTS] if len(objs)== 0: objs = [tok for tok in rights if tok.dep_ in ADJECTIVES] objs.extend(getObjsFromPrepositions(rights)) potentialNewVerb, potentialNewObjs = getObjFromXComp(rights) if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0: objs.extend(potentialNewObjs) v = potentialNewVerb if len(objs) > 0: objs.extend(getObjsFromConjunctions(objs)) return v, objs def findSVOs(tokens): svos = [] verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"] for v in verbs: subs, verbNegated = getAllSubs(v) # hopefully there are subs, if not, don't examine this verb any longer if len(subs) > 0: v, objs = getAllObjs(v) for sub in subs: for obj in objs: objNegated = isNegated(obj) svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_)) return svos def findSVAOs(tokens): svos = [] verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"] for v in verbs: subs, verbNegated = getAllSubs(v) # hopefully there are subs, if not, don't examine this verb any longer if len(subs) > 0: v, objs = getAllObjsWithAdjectives(v) for sub in subs: for obj in objs: objNegated = isNegated(obj) obj_desc_tokens = generate_left_right_adjectives(obj) sub_compound = generate_sub_compound(sub) svos.append((" ".join(tok.lower_ for tok in sub_compound), "!" + v.lower_ if verbNegated or objNegated else v.lower_, " ".join(tok.lower_ for tok in obj_desc_tokens))) return svos def generate_sub_compound(sub): sub_compunds = [] for tok in sub.lefts: if tok.dep_ in COMPOUNDS: sub_compunds.extend(generate_sub_compound(tok)) sub_compunds.append(sub) for tok in sub.rights: if tok.dep_ in COMPOUNDS: sub_compunds.extend(generate_sub_compound(tok)) return sub_compunds def generate_left_right_adjectives(obj): obj_desc_tokens = [] for tok in obj.lefts: if tok.dep_ in ADJECTIVES: obj_desc_tokens.extend(generate_left_right_adjectives(tok)) obj_desc_tokens.append(obj) for tok in obj.rights: if tok.dep_ in ADJECTIVES: obj_desc_tokens.extend(generate_left_right_adjectives(tok)) return obj_desc_tokens </code></pre> Now when you pass query such as: <pre class="prettyprint"><code>from spacy.lang.en import English parser = English() sentence = u""" Donald Trump is the worst president of USA, but Hillary is better than him """ parse = parser(sentence) print(findSVAOs(parse)) </code></pre> You will get the following: <pre class="prettyprint"><code>[(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')] </code></pre> Thank you @Krzysiek for your solution too, I actually was unable to go deep into your library to modify it. I rather tried modifying the above mentioned link to solve my problem.

How to extract subjects in a sentence and their respective dependent phrases?

Tags:

python

nlp

nltk

spacy

I am trying to work on subject extraction in a sentence, so that I can get the sentiments in accordance with the subject. I am using nltk in python2.7 for this purpose. Take the following sentence as an example:

Donald Trump is the worst president of USA, but Hillary is better than him

He we can see that Donald Trump and Hillary are the two subjects, and sentiments related to Donald Trump is negative but related to Hillary are positive. Till now, I am able to break this sentence into chunks of noun phrases, and I am able to get the following:

(S
  (NP Donald/NNP Trump/NNP)
  is/VBZ
  (NP the/DT worst/JJS president/NN)
  in/IN
  (NP USA,/NNP)
  but/CC
  (NP Hillary/NNP)
  is/VBZ
  better/JJR
  than/IN
  (NP him/PRP))

Now, how do I approach in finding the subjects from these noun phrases? Then how do I group the phrases meant for both the subjects together? Once I have the phrases meant for both the subjects separately, I can perform sentiment analysis on both of them separately.

EDIT

I looked into the library mentioned by @Krzysiek (spacy), and it gave me dependency trees as well in the sentences.

Here is the code:

from spacy.en import English
parser = English()

example = u"Donald Trump is the worst president of USA, but Hillary is better than him"
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
    print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])

Here are the dependency trees:

(u'Donald', u'compound', u'Trump', [], [])
(u'Trump', u'nsubj', u'is', [u'Donald'], [])
(u'is', u'ROOT', u'is', [u'Trump'], [u'president', u',', u'but', u'is'])
(u'the', u'det', u'president', [], [])
(u'worst', u'amod', u'president', [], [])
(u'president', u'attr', u'is', [u'the', u'worst'], [u'of'])
(u'of', u'prep', u'president', [], [u'USA'])
(u'USA', u'pobj', u'of', [], [])
(u',', u'punct', u'is', [], [])
(u'but', u'cc', u'is', [], [])
(u'Hillary', u'nsubj', u'is', [], [])
(u'is', u'conj', u'is', [u'Hillary'], [u'better'])
(u'better', u'acomp', u'is', [], [u'than'])
(u'than', u'prep', u'better', [], [u'him'])
(u'him', u'pobj', u'than', [], [])

This gives in depth insights into the dependencies of the different tokens of the sentences. Here is the link to the paper which describes the dependencies between different pairs. How can I use this tree to attach the contextual words for different subjects to them?

943

asked Sep 29 '16 06:09

psr

2 Answers

I was going through spacy library more, and I finally figured out the solution through dependency management. Thanks to this repo, I figured out how to include adjectives as well in my subjective verb object (making it SVAO's), as well as taking out compound subjects in the query. Here goes my solution:

from nltk.stem.wordnet import WordNetLemmatizer
from spacy.lang.en import English

SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
OBJECTS = ["dobj", "dative", "attr", "oprd"]
ADJECTIVES = ["acomp", "advcl", "advmod", "amod", "appos", "nn", "nmod", "ccomp", "complm",
              "hmod", "infmod", "xcomp", "rcmod", "poss"," possessive"]
COMPOUNDS = ["compound"]
PREPOSITIONS = ["prep"]

def getSubsFromConjunctions(subs):
    moreSubs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(moreSubs) > 0:
                moreSubs.extend(getSubsFromConjunctions(moreSubs))
    return moreSubs

def getObjsFromConjunctions(objs):
    moreObjs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if "and" in rightDeps:
            moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(moreObjs) > 0:
                moreObjs.extend(getObjsFromConjunctions(moreObjs))
    return moreObjs

def getVerbsFromConjunctions(verbs):
    moreVerbs = []
    for verb in verbs:
        rightDeps = {tok.lower_ for tok in verb.rights}
        if "and" in rightDeps:
            moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"])
            if len(moreVerbs) > 0:
                moreVerbs.extend(getVerbsFromConjunctions(moreVerbs))
    return moreVerbs

def findSubs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
        if len(subs) > 0:
            verbNegated = isNegated(head)
            subs.extend(getSubsFromConjunctions(subs))
            return subs, verbNegated
        elif head.head != head:
            return findSubs(head)
    elif head.pos_ == "NOUN":
        return [head], isNegated(tok)
    return [], False

def isNegated(tok):
    negations = {"no", "not", "n't", "never", "none"}
    for dep in list(tok.lefts) + list(tok.rights):
        if dep.lower_ in negations:
            return True
    return False

def findSVs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs

def getObjsFromPrepositions(deps):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and dep.dep_ == "prep":
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")])
    return objs

def getAdjectives(toks):
    toks_with_adjectives = []
    for tok in toks:
        adjs = [left for left in tok.lefts if left.dep_ in ADJECTIVES]
        adjs.append(tok)
        adjs.extend([right for right in tok.rights if tok.dep_ in ADJECTIVES])
        tok_with_adj = " ".join([adj.lower_ for adj in adjs])
        toks_with_adjectives.extend(adjs)

    return toks_with_adjectives

def getObjsFromAttrs(deps):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(getObjsFromPrepositions(rights))
                    if len(objs) > 0:
                        return v, objs
    return None, None

def getObjFromXComp(deps):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(getObjsFromPrepositions(rights))
            if len(objs) > 0:
                return v, objs
    return None, None

def getAllSubs(v):
    verbNegated = isNegated(v)
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(getSubsFromConjunctions(subs))
    else:
        foundSubs, verbNegated = findSubs(v)
        subs.extend(foundSubs)
    return subs, verbNegated

def getAllObjs(v):
    # rights is a generator
    rights = list(v.rights)
    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
    objs.extend(getObjsFromPrepositions(rights))

    potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
    if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
        objs.extend(potentialNewObjs)
        v = potentialNewVerb
    if len(objs) > 0:
        objs.extend(getObjsFromConjunctions(objs))
    return v, objs

def getAllObjsWithAdjectives(v):
    # rights is a generator
    rights = list(v.rights)
    objs = [tok for tok in rights if tok.dep_ in OBJECTS]

    if len(objs)== 0:
        objs = [tok for tok in rights if tok.dep_ in ADJECTIVES]

    objs.extend(getObjsFromPrepositions(rights))

    potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
    if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
        objs.extend(potentialNewObjs)
        v = potentialNewVerb
    if len(objs) > 0:
        objs.extend(getObjsFromConjunctions(objs))
    return v, objs

def findSVOs(tokens):
    svos = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            v, objs = getAllObjs(v)
            for sub in subs:
                for obj in objs:
                    objNegated = isNegated(obj)
                    svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_))
    return svos

def findSVAOs(tokens):
    svos = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
    for v in verbs:
        subs, verbNegated = getAllSubs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            v, objs = getAllObjsWithAdjectives(v)
            for sub in subs:
                for obj in objs:
                    objNegated = isNegated(obj)
                    obj_desc_tokens = generate_left_right_adjectives(obj)
                    sub_compound = generate_sub_compound(sub)
                    svos.append((" ".join(tok.lower_ for tok in sub_compound), "!" + v.lower_ if verbNegated or objNegated else v.lower_, " ".join(tok.lower_ for tok in obj_desc_tokens)))
    return svos

def generate_sub_compound(sub):
    sub_compunds = []
    for tok in sub.lefts:
        if tok.dep_ in COMPOUNDS:
            sub_compunds.extend(generate_sub_compound(tok))
    sub_compunds.append(sub)
    for tok in sub.rights:
        if tok.dep_ in COMPOUNDS:
            sub_compunds.extend(generate_sub_compound(tok))
    return sub_compunds

def generate_left_right_adjectives(obj):
    obj_desc_tokens = []
    for tok in obj.lefts:
        if tok.dep_ in ADJECTIVES:
            obj_desc_tokens.extend(generate_left_right_adjectives(tok))
    obj_desc_tokens.append(obj)

    for tok in obj.rights:
        if tok.dep_ in ADJECTIVES:
            obj_desc_tokens.extend(generate_left_right_adjectives(tok))

    return obj_desc_tokens

Now when you pass query such as:

from spacy.lang.en import English
parser = English()

sentence = u"""
Donald Trump is the worst president of USA, but Hillary is better than him
"""

parse = parser(sentence)
print(findSVAOs(parse))

You will get the following:

[(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')]

Thank you @Krzysiek for your solution too, I actually was unable to go deep into your library to modify it. I rather tried modifying the above mentioned link to solve my problem.

160

answered Sep 22 '22 04:09

psr

I was recently just solving very similar problem - I needed to extract subject(s), action, object(s). And I open sourced my work so you can check this library: https://github.com/krzysiekfonal/textpipeliner

This based on spacy(opponent to nltk) but it also based on sentence tree.

So for instance let's get this doc embedded in spacy as example:

import spacy
nlp = spacy.load("en")
doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
               "Pacific and was already at war with the Republic of China " \
               "in 1937, but the world war is generally said to have begun on " \
               "1 September 1939 with the invasion of Poland by Germany and " \
               "subsequent declarations of war on Germany by France and the United Kingdom. " \
               "From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
               "or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
               "Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
               "annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
               "The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
               "and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
               "the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
               "long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
               "of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
               "of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
               "the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
               "the Western Pacific.")

You can now create a simple pipes structure(more about pipes in readme of this project):

pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
                                 NamedEntityFilterPipe(),
                                 NamedEntityExtractorPipe()]),
                   FindTokensPipe("VERB"),
                   AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
                                          AggregatePipe([NamedEntityFilterPipe("GPE"), 
                                                NamedEntityFilterPipe("PERSON")]),
                                          NamedEntityExtractorPipe()]),
                            SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
                                          AggregatePipe([NamedEntityFilterPipe("LOC"), 
                                                NamedEntityFilterPipe("PERSON")]),
                                          NamedEntityExtractorPipe()])])]

engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()

And in the result you will get:

>>>[([Germany], [conquered], [Europe]),
 ([Japan], [attacked], [the, United, States])]

Actually it based strongly (the finding pipes) on another library - grammaregex. You can read about it from a post: https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc

EDITED

Actually the example I presented in readme discards adj, but all you need is to adjust pipe structure passed to engine according to your needs. For instance for your sample sentences I can propose such structure/solution which give you tuple of 3 elements(subj, verb, adj) per every sentence:

import spacy
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *

pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
                                 NamedEntityFilterPipe(),
                                 NamedEntityExtractorPipe()]),
                       AggregatePipe([FindTokensPipe("VERB"),
                                      FindTokensPipe("VERB/xcomp/VERB/aux/*"),
                                      FindTokensPipe("VERB/xcomp/VERB")]),
                       AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
                                AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
                                               FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
                      ]

engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()

It will give you result:

[([Donald, Trump], [is], [the, worst])]

A little bit complexity is in the fact you have compound sentence and the lib produce one tuple per sentence - I'll soon add possibility(I need it too for my project) to pass a list of pipe structures to engine to allow produce more tuples per sentence. But for now you can solve it just by creating second engine for compounded sents which structure will differ only of VERB/conj/VERB instead of VERB(those regex starts always from ROOT, so VERB/conj/VERB lead you to just second verb in compound sentence):

pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
                                 NamedEntityFilterPipe(),
                                 NamedEntityExtractorPipe()]),
                   AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
                                  FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
                                  FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
                   AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
                            AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
                                           FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
                  ]

engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])

And now after you run both engines you will get expected result :)

engine.process()
engine2.process()
[([Donald, Trump], [is], [the, worst])]
[([Hillary], [is], [better])]

This is what you need I think. Of course I just quickly created a pipe structure for given example sentence and it won't work for every case, but I saw a lot of sentence structures and it will already fulfil quite nice percentage, but then you can just add more FindTokensPipe etc for cases which won't work currently and I'm sure after a few adjustment you will cover really good number of possible sentences(english is not too complex so...:)

answered Sep 20 '22 04:09

Krzysiek

Related questions
                            
                                Sphinx apidoc section titles for Python module/package names
                            
                                Is Python incorrectly handling this "arbitrary precision integer"?
                            
                                C array vs NumPy array
                            
                                python NameError: name '__file__' is not defined [duplicate]
                            
                                how to unstack (or pivot?) in pandas
                            
                                Dealing with the class imbalance in binary classification
                            
                                How to find and replace nth occurrence of word in a sentence using python regular expression?
                            
                                FAILED: No config file 'alembic.ini' found
                            
                                Serve image stored in SQLAlchemy LargeBinary column
                            
                                Select everything but a list of columns from pandas dataframe
                            
                                How to turn off INFO from logs in PySpark with no changes to log4j.properties?
                            
                                python re.sub, only replace part of match [duplicate]
                            
                                Retrieving public dns of EC2 instance with BOTO3
                            
                                Sqlalchemy: subquery in FROM must have an alias
                            
                                Using getattr in Jinja2 gives me an error (jinja2.exceptions.UndefinedError: 'getattr' is undefined)
                            
                                Getting csv.Sniffer to work with quoted values
                            
                                How to access Enum types in Django templates
                            
                                Django rest auth email instead of username
                            
                                Calculate max draw down with a vectorized solution in python
                            
                                read_csv doesn't read the column names correctly on this file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With