I am doing sentiment analysis on given documents, my goal is I want to find out the closest or surrounding adjective words respect to target phrase in my sentences. I do have an idea how to extract surrounding words respect to target phrases, but How do I find out relatively close or closest adjective or NNP
or VBN
or other POS tag respect to target phrase.
Here is the sketch idea of how I may get surrounding words to respect to my target phrase.
sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}
target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}
Note that my original dataset was given as dataframe where the list of the sentence and respective target phrases were given. Here I just simulated data as follows:
import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)
Here I tokenize the sentence as follow:
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]
then I try to find out surrounding words respect to my target phrases by using this loot at here. However, I want to find out relatively closer or closet adjective
, or verbs
or VBN
respect to my target phrase. How can I make this happen? Any idea to get this done? Thanks
Would something like the following work for you? I recognize there are some tweaks that need to be made to make this fully useful (checking for upper/lower case; it will also return the word ahead in the sentence rather than the one behind if there is a tie) but hopefully it is useful enough to get you started:
import nltk
from nltk.tokenize import MWETokenizer
def smart_tokenizer(sentence, target_phrase):
"""
Tokenize a sentence using a full target phrase.
"""
tokenizer = MWETokenizer()
target_tuple = tuple(target_phrase.split())
tokenizer.add_mwe(target_tuple)
token_sentence = nltk.pos_tag(tokenizer.tokenize(sentence.split()))
# The MWETokenizer puts underscores to replace spaces, for some reason
# So just identify what the phrase has been converted to
temp_phrase = target_phrase.replace(' ', '_')
target_index = [i for i, y in enumerate(token_sentence) if y[0] == temp_phrase]
if len(target_index) == 0:
return None, None
else:
return token_sentence, target_index[0]
def search(text_tag, tokenized_sentence, target_index):
"""
Search for a part of speech (POS) nearest a target phrase of interest.
"""
for i, entry in enumerate(tokenized_sentence):
# entry[0] is the word; entry[1] is the POS
ahead = target_index + i
behind = target_index - i
try:
if (tokenized_sentence[ahead][1]) == text_tag:
return tokenized_sentence[ahead][0]
except IndexError:
try:
if (tokenized_sentence[behind][1]) == text_tag:
return tokenized_sentence[behind][0]
except IndexError:
continue
x, i = smart_tokenizer(sentence='My problem was with DELL Customer Service',
target_phrase='DELL Customer Service')
print(search('NN', x, i))
y, j = smart_tokenizer(sentence="Good for everyday computing and web browsing.",
target_phrase="everyday computing")
print(search('NN', y, j))
Edit: I made some changes to address the issue of using an arbitrary length target phrase, as you can see in the smart_tokenizer
function. The key there is the nltk.tokenize.MWETokenizer
class (for more info see: Python: Tokenizing with phrases). Hopefully this helps. As an aside, I would challenge the idea that spaCy
is necessarily more elegant - at some point, someone has to write the code to get the work done. This will either that will be the spaCy
devs, or you as you roll your own solution. Their API is rather complicated so I'll leave that exercise to you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With