Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex to get measurements

I have these measurements in the document

5.3 x 2.5 cm
11 x 11 mm
7 mm 
13 x 12 x 14 mm
13x12cm

I need to extract 5.3 x 2.5 cm using python using regex.

So far my code is below but it does not work properly

x = "\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?"
by = "( )?(by|x)( )?"
cm = "(mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "((" + x + " *(to|\-) *" + cm + ")" + "|(" + x + cm + "))"
xy_cm = "((" + x + cm + by + x + cm + ")" +"|(" + x + by + x + cm + ")" +"|(" + x + by + x + "))"
xyz_cm = "((" + x + cm + by + x + cm + by + x + cm + ")" + "|(" + x + by + x + by + x + cm + ")" + "|(" + x + by + x + by + x + "))"
m = "((" + xyz_cm + ")" + "|(" + xy_cm + ")" + "|(" + x_cm + "))"
a = re.compile(m)
print a.findall(text)

The output it gives:

[('13', '13', '13', '13', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('12', '12', '12', '12', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('4', '4', '4', '4', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('25', '25', '25', '25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''),
like image 601
user1979556 Avatar asked Sep 02 '17 07:09

user1979556


1 Answers

With Regex you should always slowly build up your expression to get what you want. E.g.

s = "5.3 x 2.5 cm"

You want to find the numbers here?

re.findall("\d+", s)

gives you all the integers:

["5", "3", "2", "5"]

Ok, so what if your numbers can be floating point but don't have to be. Then you expand your expression with a non-capturing match group that has a dot and maybe some numbers following.

re.findall("\d+(?:\.\d*)?", s)

this gives you

["5.3", "2.5"]

Then you can take the multiplication with an arbitrary number of spaces around:

re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)", s)

Putting the numbers in match groups now gives you a tuple.

[("5.3", "2.5")]

You can then go on with the units:

re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)\s*(cm|mm)", s)

giving you the tuple you want:

[("5.3", "2.5", "cm")]

and so on.

If you build your regexes like this you have a chance to see what breaks from one change to the next. Debugging a huge regex like the one you posted above is a task not worth going at.

I wouldn't name my unit regex as cm that's quite confusing for anyone maintaining your code in the future. Apart from that you need some clear requirements on the number formats you want to allow. Maybe somebody will input scientific notation etc. Your regexes will become very complicated.

like image 163
CodeMonkey Avatar answered Oct 14 '22 23:10

CodeMonkey