Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting food items from sentences

Tags:

algorithm

nlp

Given a sentence:

I had peanut butter and jelly sandwich and a cup of coffee for breakfast

I want to be able to extract the following food items from it:

peanut butter and jelly sandwich

coffee

Till now, using POS tagging, I have been able to extract the individual food items, i.e.

peanut, butter, jelly, sandwich, coffee

But like I said, what I need is peanut butter and jelly sandwich instead of the individual items.

Is there some way of doing this without having a corpus or database of food items in the backend?

like image 926
bigbong Avatar asked May 11 '17 08:05

bigbong


2 Answers

You can attempt it without using a trained set which contains a corpus of food items, but the approach shall work without it too.

Instead of doing simple POS tagging, do a dependency parsing combined with POS tagging. That way would be able to find relations between multiple tokens of the phrase, and parsing the dependency tree with restricted conditions like noun-noun dependencies you shall be able to find relevant chunk.

You can use spacy for dep parsing. Here is output from displacy :

https://demos.explosion.ai/displacy/?text=peanut%20butter%20and%20jelly%20sandwich%20is%20delicious&model=en&cpu=1&cph=1

enter image description here

enter image description here

  • You can use freely available data here, or something better: https://en.wikipedia.org/wiki/Lists_of_foods as a training set to create a base set of food items (the hyperlinks in the crawled tree)
  • Based on the dependency parsing on your new data, you can keep enriching the base data. For example: if 'butter' exists in your corpus, and 'peanut butter' is a frequently encountered pair of tokens, then 'peanut' and 'peanut butter' also get added to the corpus.
  • The corpus can be maintained in a file which can be loaded in memory while processing, or database like redis,aerospike etc.
  • Make sure you work with normalized i.e. small cased, special characters cleaned, words lemmatized/stemmed, both in corpus and the processing data. That would increase your coverage and accuracy.
like image 110
DhruvPathak Avatar answered Sep 27 '22 23:09

DhruvPathak


First extract all Noun phrases using NLTK's Chunking (code copied from here):

import nltk
import re
import pprint
from nltk import Tree
import pdb


patterns="""
    NP: {<JJ>*<NN*>+}
    {<JJ>*<NN*><CC>*<NN*>+}
    {<NP><CC><NP>}
    {<RB><JJ>*<NN*>+}
    """

NPChunker = nltk.RegexpParser(patterns)

def prepare_text(input):
    sentences = nltk.sent_tokenize(input)
    sentences = [nltk.word_tokenize(sent) for sent in sentences] 
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [NPChunker.parse(sent) for sent in sentences]
    return sentences


def parsed_text_to_NP(sentences):
    nps = []
    for sent in sentences:
        tree = NPChunker.parse(sent)
        print(tree)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps


def sent_parse(input):
    sentences = prepare_text(input)
    nps = parsed_text_to_NP(sentences)
    return nps



if __name__ == '__main__':
    print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))

This will POS tag your sentences and uses a regex parser to extract Noun Phrases.

1.Define and Refine your noun phrase regex

You'll need to change the patterns regex to define and refine your Noun phrases. For example is telling the parser than an NP followed by a coordinator (CC) like ''and'' and another NP is itself an NP.

2.Change from NLTK POS tagger to Stanford POS tagger

Also I noted that NLTK's POS tagger is not performing very well (e.g. It considers had peanut as a verb phrase. You can change the POS tagger to Stanford Parser if you want.

3.Remove smaller noun phrases:

After you have extracted all the Noun phrases for a sentence, you can remove the ones that are part of a bigger noun phrase. For example in the following example beef burger and peanut butter should be removed because they're a part of a bigger noun phrase peanut butter and beef burger.

4.Remove noun phrases which none of the words are in a food lexicon

you will get noun phrases like school bus. if none of school and bus is in a food lexicon that you can compile from Wikipedia or WordNet then you remove the noun phrase. In this case remove cup and breakfast because they're not hopefully in your food lexicon.

The current code returns

['peanut butter and beef burger', 'peanut butter', 'beef burger', 'cup', 'coffee', 'breakfast']

for input

print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
like image 21
Ash Avatar answered Sep 27 '22 21:09

Ash