I have these measurements in the document
5.3 x 2.5 cm
11 x 11 mm
7 mm
13 x 12 x 14 mm
13x12cm
I need to extract 5.3 x 2.5 cm using python using regex.
So far my code is below but it does not work properly
x = "\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?"
by = "( )?(by|x)( )?"
cm = "(mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "((" + x + " *(to|\-) *" + cm + ")" + "|(" + x + cm + "))"
xy_cm = "((" + x + cm + by + x + cm + ")" +"|(" + x + by + x + cm + ")" +"|(" + x + by + x + "))"
xyz_cm = "((" + x + cm + by + x + cm + by + x + cm + ")" + "|(" + x + by + x + by + x + cm + ")" + "|(" + x + by + x + by + x + "))"
m = "((" + xyz_cm + ")" + "|(" + xy_cm + ")" + "|(" + x_cm + "))"
a = re.compile(m)
print a.findall(text)
The output it gives:
[('13', '13', '13', '13', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('12', '12', '12', '12', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('4', '4', '4', '4', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('25', '25', '25', '25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''),
With Regex you should always slowly build up your expression to get what you want. E.g.
s = "5.3 x 2.5 cm"
You want to find the numbers here?
re.findall("\d+", s)
gives you all the integers:
["5", "3", "2", "5"]
Ok, so what if your numbers can be floating point but don't have to be. Then you expand your expression with a non-capturing match group that has a dot and maybe some numbers following.
re.findall("\d+(?:\.\d*)?", s)
this gives you
["5.3", "2.5"]
Then you can take the multiplication with an arbitrary number of spaces around:
re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)", s)
Putting the numbers in match groups now gives you a tuple.
[("5.3", "2.5")]
You can then go on with the units:
re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)\s*(cm|mm)", s)
giving you the tuple you want:
[("5.3", "2.5", "cm")]
and so on.
If you build your regexes like this you have a chance to see what breaks from one change to the next. Debugging a huge regex like the one you posted above is a task not worth going at.
I wouldn't name my unit regex as cm
that's quite confusing for anyone maintaining your code in the future. Apart from that you need some clear requirements on the number formats you want to allow. Maybe somebody will input scientific notation etc. Your regexes will become very complicated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With