I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:
POLYGON((51.124 -3.973, 51.1 -3.012, ....))
or
MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))
among other variations.
There is a good answer here: Parsing a WKT-file which uses the regex:
\d+(?:\.\d*)?
From other places I have also seen
\d*\.\d+|\d+
and
(\d*\.)?\d+
These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.
To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.
Short for regular expression, a regex is a string of text that lets you create patterns that help match, locate, and manage text. Perl is a great example of a programming language that utilizes regular expressions. However, its only one of the many places you can find regular expressions.
?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).
The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.
Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).
It depends what number formats you need to allow, example:
format 1: 22 format 2: 22.2 format 3: .2 format 4: 2.
\d+(?:\.\d*)?
matches 1,2,4\d*\.\d+|\d+
matches 1,2,3(\d*\.)?\d+
matches 1,2,3 (and have an uneeded capturing group)Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)
str | pattern | state -----+----------------+----------------------------- 123 | \d*\.\d+|\d+ | START 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | FAIL => backtrack 123 | \d*\.\d+|\d+ | FAIL => backtrack 123 | \d*\.\d+|\d+ | FAIL => backtrack 123 | \d*\.\d+|\d+ | go to the next alternative 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK => SUCCESS
if you want to match the four cases, you can use:
\.\d+|\d+(?:\.\d*)?
(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+
Obviously, if you want to support negative values you need to add -?
:
-?(?:\.\d+|\d+(?:\.\d*)?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With