Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to improve this regular expression?

I'm kinda a newbie at regular expressions, so would appreciate a bit of peer feedback on this one. It will be heavily used on my site, so any weird edge cases can totally wreak havoc. The idea is to type in an amount of an ingredient in a recipe in whole units or fractions. Due to my autocomplete mechanism, just a number is valid too (since it'll pop up a dropdown). These lines are valid:

1
1/2
1 1/2
4 cups
4 1/2 cups
10 3/4 cups sliced

The numeric part of the line should be its own group so I can parse that with my fraction parser. Everything after the numeric part should be a second group. At first, I tried this:

^\s*(\d+|\d+\/\d+|\d+\s*\d+\/\d+)\s*(.*)$

This almost works, but "1 1/2 cups" will get parsed as (1) (1/2 cups) instead of (1 1/2) and (cups). After scratching my head a bit, I determined this was because of the ordering of my "OR" clause. (1) satisfies the \d+ and (.*) satisfies the rest. So I changed this to:

^\s*(\d+\/\d+|\d+\s*\d+\/\d+|\d+)\s*([a-z].*)$

This almost works, but allows weirdness such as "1 1/2/4 cups" or "1/2 3 cups". So I decided to enforce a letter as the first character after a valid numeric expression:

^\s*(\d+\/\d+|\d+\s*\d+\/\d+|\d+)\s*($|[a-z].*)$

Note I'm running this in case-insensitive mode. Here's my questions:

  1. Can the expression be improved? I kinda don't like the "OR" list for number, fraction, compound fraction but I couldn't think of a way to allow whole numbers, fractions, or compound fractions.

  2. It would be extra nice if I could return a group for each word after the numeric component. Such as a group for (10 3/4), a group for (cups) and a group for (sliced). There can be any number of words after. Is this possible?

Thanks!

like image 865
Mike Christensen Avatar asked Aug 23 '10 01:08

Mike Christensen


2 Answers

Well, it appears to me that you don't need OR conditions at all (but see below).

For the numeric bit, you could get away with:

\d+(\s+\d+/\d+)

which would handle all those fractional values.

I would still keep your decimal separate with an OR clause since it's likely to complicate things. So I think you could probably get away with something like:

^\s*((\d+\s)?(\d+/\d+)?|\d+(\.\d+)?)\s*([a-z].*)?$
 |   |                  |           |  |
 |   |                  |           |  +--- start of alpha section.
 |   |                  |           +------ optional white space.
 |   |                  +------------------ decimal (nn[.nn])
 |   +------------------------------------- fractional ([nn ][nn/nn])
 +----------------------------------------- optional starting space.

although that allows for an empty fractional amount so you may be better off with what you've got (whole, fractional and decimal in separate OR clauses).

I prefer the ([a-z].*)?$ construct to ($|[a-z].*)$ myself but that may just be an aversion on my past to have multiple line end markers in my RE :-)


But, in all honesty, I think you may be trying to swat a fly with a thermo-nuclear warhead here.

Do you really need to restrict what gets entered. I've seen recipes that call for a pinch of salt and a handful of sultanas. I personally think you may be being to restrictive in what you'll allow. I would have a free-form field for quantity and a drop-down for food-type (actually I would probably just allow free-form for the lot unless I was offering the ability to search for recipes based on what's in the fridge).

like image 55
paxdiablo Avatar answered Oct 07 '22 14:10

paxdiablo


I believe that this regex should do what you want:

/^\s*(\d+ \d+\/\d+|\d+\/\d+|\d+)\s*(.*)/

For matching the specific words you should just do a split on whitespace after the parsing. There are some thing you don't want to do with regexes ;)

like image 20
Wolph Avatar answered Oct 07 '22 15:10

Wolph