Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any efficient way to find surrounding ADJ respect to target phrase in python?

I am doing sentiment analysis on given documents, my goal is I want to find out the closest or surrounding adjective words respect to target phrase in my sentences. I do have an idea how to extract surrounding words respect to target phrases, but How do I find out relatively close or closest adjective or NNP or VBN or other POS tag respect to target phrase.

Here is the sketch idea of how I may get surrounding words to respect to my target phrase.

sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}

target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}

Note that my original dataset was given as dataframe where the list of the sentence and respective target phrases were given. Here I just simulated data as follows:

import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)

Here I tokenize the sentence as follow:

from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]

then I try to find out surrounding words respect to my target phrases by using this loot at here. However, I want to find out relatively closer or closet adjective, or verbs or VBN respect to my target phrase. How can I make this happen? Any idea to get this done? Thanks

like image 899
Hamilton Avatar asked Oct 29 '22 00:10

Hamilton


1 Answers

Would something like the following work for you? I recognize there are some tweaks that need to be made to make this fully useful (checking for upper/lower case; it will also return the word ahead in the sentence rather than the one behind if there is a tie) but hopefully it is useful enough to get you started:

import nltk
from nltk.tokenize import MWETokenizer

def smart_tokenizer(sentence, target_phrase):
    """
    Tokenize a sentence using a full target phrase.
    """
    tokenizer = MWETokenizer()
    target_tuple = tuple(target_phrase.split())
    tokenizer.add_mwe(target_tuple)
    token_sentence = nltk.pos_tag(tokenizer.tokenize(sentence.split()))

    # The MWETokenizer puts underscores to replace spaces, for some reason
    # So just identify what the phrase has been converted to
    temp_phrase = target_phrase.replace(' ', '_')
    target_index = [i for i, y in enumerate(token_sentence) if y[0] == temp_phrase]
    if len(target_index) == 0:
        return None, None
    else:
        return token_sentence, target_index[0]


def search(text_tag, tokenized_sentence, target_index):
    """
    Search for a part of speech (POS) nearest a target phrase of interest.
    """
    for i, entry in enumerate(tokenized_sentence):
        # entry[0] is the word; entry[1] is the POS
        ahead = target_index + i
        behind = target_index - i
        try:
            if (tokenized_sentence[ahead][1]) == text_tag:
                return tokenized_sentence[ahead][0]
        except IndexError:
            try:
                if (tokenized_sentence[behind][1]) == text_tag:
                    return tokenized_sentence[behind][0]
            except IndexError:
                continue

x, i = smart_tokenizer(sentence='My problem was with DELL Customer Service',
                       target_phrase='DELL Customer Service')
print(search('NN', x, i))

y, j = smart_tokenizer(sentence="Good for everyday computing and web browsing.",
                       target_phrase="everyday computing")
print(search('NN', y, j))

Edit: I made some changes to address the issue of using an arbitrary length target phrase, as you can see in the smart_tokenizer function. The key there is the nltk.tokenize.MWETokenizer class (for more info see: Python: Tokenizing with phrases). Hopefully this helps. As an aside, I would challenge the idea that spaCy is necessarily more elegant - at some point, someone has to write the code to get the work done. This will either that will be the spaCy devs, or you as you roll your own solution. Their API is rather complicated so I'll leave that exercise to you.

like image 81
HFBrowning Avatar answered Nov 11 '22 13:11

HFBrowning