Given a sentence:
I had peanut butter and jelly sandwich and a cup of coffee for breakfast
I want to be able to extract the following food items from it:
peanut butter and jelly sandwich
coffee
Till now, using POS tagging, I have been able to extract the individual food items, i.e.
peanut, butter, jelly, sandwich, coffee
But like I said, what I need is peanut butter and jelly sandwich instead of the individual items.
Is there some way of doing this without having a corpus or database of food items in the backend?
You can attempt it without using a trained set which contains a corpus of food items, but the approach shall work without it too.
Instead of doing simple POS tagging, do a dependency parsing combined with POS tagging. That way would be able to find relations between multiple tokens of the phrase, and parsing the dependency tree with restricted conditions like noun-noun dependencies you shall be able to find relevant chunk.
You can use spacy for dep parsing. Here is output from displacy :
https://demos.explosion.ai/displacy/?text=peanut%20butter%20and%20jelly%20sandwich%20is%20delicious&model=en&cpu=1&cph=1
First extract all Noun phrases using NLTK's Chunking (code copied from here):
import nltk
import re
import pprint
from nltk import Tree
import pdb
patterns="""
NP: {<JJ>*<NN*>+}
{<JJ>*<NN*><CC>*<NN*>+}
{<NP><CC><NP>}
{<RB><JJ>*<NN*>+}
"""
NPChunker = nltk.RegexpParser(patterns)
def prepare_text(input):
sentences = nltk.sent_tokenize(input)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
sentences = [NPChunker.parse(sent) for sent in sentences]
return sentences
def parsed_text_to_NP(sentences):
nps = []
for sent in sentences:
tree = NPChunker.parse(sent)
print(tree)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
t = subtree
t = ' '.join(word for word, tag in t.leaves())
nps.append(t)
return nps
def sent_parse(input):
sentences = prepare_text(input)
nps = parsed_text_to_NP(sentences)
return nps
if __name__ == '__main__':
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
This will POS tag your sentences and uses a regex parser to extract Noun Phrases.
1.Define and Refine your noun phrase regex
You'll need to change the patterns regex to define and refine your Noun phrases. For example is telling the parser than an NP followed by a coordinator (CC) like ''and'' and another NP is itself an NP.
2.Change from NLTK POS tagger to Stanford POS tagger
Also I noted that NLTK's POS tagger is not performing very well (e.g. It considers had peanut as a verb phrase. You can change the POS tagger to Stanford Parser if you want.
3.Remove smaller noun phrases:
After you have extracted all the Noun phrases for a sentence, you can remove the ones that are part of a bigger noun phrase. For example in the following example beef burger and peanut butter should be removed because they're a part of a bigger noun phrase peanut butter and beef burger.
4.Remove noun phrases which none of the words are in a food lexicon
you will get noun phrases like school bus. if none of school and bus is in a food lexicon that you can compile from Wikipedia or WordNet then you remove the noun phrase. In this case remove cup and breakfast because they're not hopefully in your food lexicon.
The current code returns
['peanut butter and beef burger', 'peanut butter', 'beef burger', 'cup', 'coffee', 'breakfast']
for input
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With