Parsing natural language ingredient quantities for recipes [closed]

Question

I'm building a ruby recipe management application, and as part of it, I want to be able to parse ingredient quantities into a form I can compare and scale. I'm wondering what the best tools are for doing this.

I originally planned on a complex regex, then on some other code that converts human readable numbers like two or five into integers, and finally code that will convert say 1 cup and 3 teaspoons into some base measurement. I control the input, so I kept the actual ingredient separate. However, I noticed users inputting abstract measurements like to taste and 1 package. At least with the abstract measurements, I think I could just ignore them and scale and just scrape any number preceding them.

Here are some more examples

1 tall can
1/4 cup
2 Leaves
1 packet
To Taste
One
Two slices
3-4 fillets
Half-bunch
2 to 3 pinches (optional)

Are there any tricks to this? I have noticed users seem somewhat confused of what constitutes a quantity. I could try to enforce stricter rules and push things like tall can and leaves into the ingredient part. However, in order to enforce that, I need to be able to convey what's invalid.

I'm also not sure what the "base" measurement I should convert quantities into.

These are my goals.

To be able to scale recipes. Arbitrary units of measurement like packages don't have to be scaled but precise ones like cups or ounces need to be.
Figure out the "main" ingredients. In the context of this question, this will be done largely by figuring out what the largest ingredient is in the recipe. In production, there will have to be some sort of modifier based on the type of ingredient because, obviously, flour is almost never considered the "main" ingredient. However, chocolate can be used sparingly, and it can still be said a chocolate cake.
Normalize input. To keep some consistency on the site, I want to keep consistent abbreviations. For example, instead of pounds, it should be lbs.

alexis · Accepted Answer

You pose two problems, recognizing/extracting the quantity expressions (syntax) and figuring out what amount they mean (semantics).

Before you figure out whether regexps are enough to recognize the quantities, you should make yourself a good schema (grammar) of what they look like. Your examples look like this:

<amount> <unit> [of <ingredient>]

where <amount> can take many forms:

whole or decimal number, in digits (250, 0.75)
common fraction (3/4)
numeral in words (half, one, ten, twenty-five, three quarters)
determiner instead of a numeral ("an onion")
subjective (some, a few, several)

The amount can also be expressed as a range of two simple <amount>s:

two to three
2 to 3
2-3
five to 10

Then you have the units themselves:

general-purpose measurements (lb, oz, kg, g; pounds, ounces, etc.)
cooking units (Tb, tsp)
informal units (a pinch, a dash)
container sizes (package, bunch, large can)
no unit at all, for countable ingredients (as in "three lemons")

Finally, there's a special case of expressions that can never be combined with either amounts or units, so they effectively function as a combination of both:

a little
to taste

I'd suggest approaching this as a small parser, which you can make as detailed or as rough as you need to. It shouldn't be too hard to write regexps for all of those, if that's your tool of choice, but as you see it's not just a question of textual substitution. Pull the parts out and represent each ingredient as a triple (amount, unit, ingredient). (For countables, use a special unit "pieces" or whatever; for "a little" and the like, I'd treat them as special units).

That leaves the question of converting or comparing the quantities. Unit conversion has been done in lots of places, so at least for the official units you should have no trouble getting the conversion tables. Google will do it if you type "convert 4oz to grams", for example. Note that a Tbsp is either three or four tsp, depending on the country.

You can standardize to your favorite units pretty easily for well-defined units, but the informal units are a little trickier. For "a pinch", "a dash", and the like, I would suggest finding out the approximate weight so that you can scale properly (ten pinches = 2 grams, or whatever). Cans and the like are hopeless, unless you can look up the size of particular products.

On the other hand, subjective amounts are the easiest: If you scale up "to taste" ten times, it's still "to taste"!

One last thought: Some sort of database of ingredients is also needed for recognizing the main ingredients, since size matters: "One egg" is probably not the major ingredient, but "one small goat, quartered" may well be. I would consider it for version 2.

Josh Voigts · Answer

Regular expressions are difficult to get right for natural language parsing. NLTK, like you mentioned, would probably be a good option to look into otherwise you'll find yourself going around in circles trying to get the expressions right.

If you want something of the Ruby variety instead of NLTK, take a look at Treat:

https://github.com/louismullie/treat

Also, the Linguistics framework might be a good option as well:

http://deveiate.org/projects/Linguistics

EDIT:

I figured there had to already be a Ruby recipe parser out there, here's another option you might want to look into:

https://github.com/iancanderson/ingreedy

Aditya Mukherji · Answer

There is a lot of free training data available out there if you know how to write a good web scraper and parsing tool.

http://allrecipes.com/Recipe/Darias-Slow-Cooker-Beef-Stroganoff - This site seems to let you convert recipe quantities based on metric/imperial system and number of diners.

http://www.epicurious.com/tools/conversions/common - This site seems to have lots of conversion constants.

Some systematic scraping of existing recipe sites which present ingredients, procedures in some structured format (which you can discover by reading the underlying html) will help you build up a really large training data set which will make taking on such a problem much much easier.

When you have tons of data, even simple learning techniques can be pretty useful. Once you have a lot of data, you can use standard nlp tricks (ngrams, tf-idf, naive bayes, etc) to quickly do awesome things.

For example:
Main Ingredient-ness
Ingredients in a dish with a higher idf (inverse document frequency) are more likely to be main ingredients. Every dish mentions salt, so it should have very low idf. A lot fewer dishes mention oil, so it should have a higher idf. Most dishes probably have only one main protein, so phrases like 'chicken', 'tofu', etc should be rarer and much more likely to be main ingredients than salt, onions, oil, etc. Of course there may be items like 'cilantro' which might be rarer than 'chicken', but if you had scraped out some relevant metadata along with every dish, you will have signals that will help you fix this issue as well. Most chefs might not be using cilantro in their recipes, but the ones that do probably use it quite a lot. So for any ingredient name, you can figure out the name's idf by first considering only the authors that have mentioned the ingredient at least once, and then seeing the ingredient's idf on this subset of recipes.

Scaling recipes
Most recipe sites mention how many people does a particular dish serve, and have a separate ingredients list with appropriate quantities for that number of people.
For any particular ingredient, you can collect all the recipes that mention it and see what quantity of the ingredient was prescribed for what number of people. This should tell you what phrases are used to describe quantities for that ingredient, and how the numbers scale. Also you can now collect all the ingredients whose quantities have been described using a particular phrase (e.g. 'slices' -> (bread, cheese, tofu,...), 'cup' -> (rice, flour, nuts, ...)) and look at the most common of these phrases and manually write down how they would scale.

Normalize Input
This does not seem like a hard problem at all. Manually curating a list of common abbreviations and their full forms (e.g 'lbs' -> 'pounds', 'kgs' -> 'kilograms', 'oz' -> 'ounces', etc) should solve 90% of the problem. Adding new contractions to this list whenever you see them should make this list pretty comprehensive after a while.

In summary, I am asking you to majorly increase the size of your data and collect lots of relevant metadata along with each recipe you scrape (author info, food genre, etc), and use all this structured data along with simple NLP/ML tricks to solve most problems you will face while trying to build an intelligent recipe site.

Parsing natural language ingredient quantities for recipes [closed]

Tags:

regex

ruby

nlp

hadees

3 Answers

alexis

Josh Voigts

Aditya Mukherji

Recent Activity

Donate For Us

Parsing natural language ingredient quantities for recipes [closed]

Tags:

regex

ruby

nlp

hadees

3 Answers

alexis

Josh Voigts

Aditya Mukherji

Related questions

Recent Activity

Donate For Us