Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using spaCy to replace the "topic" of a sentence

Tags:

python

spacy

So as a bit of a thought experiment I coded up a function in python that uses spaCy to find the subject of a news article, then replace it with a noun of choice. The problem is, it doesn't exactly work well, and I was hoping it could be improved. I don't exactly understand spaCy that well, and the documentation is a bit hard to understand.

First, the code:

doc=nlp(thetitle)
for text in doc:
    #subject would be
    if text.dep_ == "nsubj":
        subject = text.orth_
    #iobj for indirect object
    if text.dep_ == "iobj":
        indirect_object = text.orth_
        #dobj for direct object
    if text.dep_ == "dobj":
        direct_object = text.orth_
try:
    subject
except NameError:
    if not thetitle: #if empty title
        thetitle = "cat"
        subject = "cat"
    else: #if unknown subject
        try: #do we have a direct object?
            direct_object
        except NameError:
            try: #do we have an indirect object?
                indirect_object
            except NameError: #still no??
                subject = random.choice(thetitle.split())
            else:
                subject = indirect_object
        else:
            subject = direct_object
else:
    thecat = "cat" #do nothing here, everything went okay
newtitle = re.sub(r"\b%s\b" % subject, toreplace, thetitle)
if (newtitle == thetitle) : #if no replacement happened due to regex
    newtitle = thetitle.replace(subject, toreplace)
return newtitle

the "cat" lines are filler lines that don't do anything. "thetitle" is a variable for a random news article title I'm pulling in from RSS feeds. "toreplace" is the variable that holds the string to replace whatever the found subject is.

Let's use an example:

"Video Games that Should Be Animated TV Shows - Screen Rant" And here's the displaCy breakdown of that: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows%20-%20Screen%20Rant&model=en&cpu=1&cph=1

The word the code decided to replace ended up being "that", which isn't even a noun in this sentence, but seems to have resulted in the random word choice fallback, since it couldn't find a subject, indirect object, or direct object. My hope is that it would find something more like "Video games" in this example.

I should note if I take the last bit out (which appears to be the source for the news article) in displaCy: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows&model=en&cpu=1&cph=1 it seems to think "that" is the subject, which is incorrect.

What is a better way to parse this? Should I look for proper nouns first?

like image 639
SpaceMouse Avatar asked Jun 05 '17 18:06

SpaceMouse


1 Answers

Not directly answering your question, I think the code below is far more readable because the conditions are explicit, and what happens when a condition is valid is not buried in an else clause far away. This code also takes care of the cases with multiple objects.

To your problem: any natural language processing tool will have a hard time to find the subject (or maybe rather topic) of a sentence fragment, they are trained with complete sentences. I'm not even sure if such fragments technically have subjects (I'm not an expert, though). You could try to train your own model, but then you will have to provide labeled sentences, I don't know if such a thing already exists for sentence fragments.

I am not fully sure what you want to achieve, looking at the common nouns and pronouns might likely contain the word you want to replace, and the first one appearing is likely the most important.

import spacy
import random
import re
from collections import defaultdict

def replace_subj(sentence, nlp):
    doc = nlp(sentence)
    tokens = defaultdict(list)

    for text in doc:
        tokens[text.dep_].append(text.orth_)

    if not sentence:
        return "cat"

    if "nsubj" in tokens:
        subject = tokens["nsubj"][0]
    elif "dobj" in tokens:
        subject = tokens["dobj"][0]
    elif "iobj" in tokens:
        subject = tokens["iobj"][0]
    else:
        subject = random.choice(sentence.split())

    return re.sub(r"\b{}\b".format(subject), "cat", sentence)

if __name__ == "__main__":
    sentence = """Video Games that Should Be Animated TV Shows - Screen Rant"""

    nlp = spacy.load("en")
    print(replace_subj(sentence, nlp))
like image 72
mwil.me Avatar answered Oct 14 '22 13:10

mwil.me