I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like: <pre class="prettyprint"><code>POLYGON((51.124 -3.973, 51.1 -3.012, ....)) </code></pre> or <pre class="prettyprint"><code>MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....)) </code></pre> among other variations. There is a good answer here: Parsing a WKT-file which uses the regex: <pre class="prettyprint"><code>\d+(?:\.\d*)? </code></pre> From other places I have also seen <pre class="prettyprint"><code>\d*\.\d+|\d+ </code></pre> and <pre class="prettyprint"><code>(\d*\.)?\d+ </code></pre> These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of. To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.

It depends what number formats you need to allow, example: <pre class="prettyprint"> format 1: 22 format 2: 22.2 format 3: .2 format 4: 2. </pre> <ul> <li>the 1st pattern <code>\d+(?:\.\d*)?</code> matches 1,2,4 </li> <li>the 2nd pattern <code>\d*\.\d+|\d+</code> matches 1,2,3 </li> <li>the 3rd pattern <code>(\d*\.)?\d+</code> matches 1,2,3 (and have an uneeded capturing group)</li> </ul> Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below) <pre class="prettyprint"> str | pattern | state -----+----------------+----------------------------- 123 | \d*\.\d+|\d+ | START 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | FAIL => backtrack 123 | \d*\.\d+|\d+ | FAIL => backtrack 123 | \d*\.\d+|\d+ | FAIL => backtrack 123 | \d*\.\d+|\d+ | go to the next alternative 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK 123 | \d*\.\d+|\d+ | OK => SUCCESS </pre> if you want to match the four cases, you can use: <pre class="prettyprint"><code>\.\d+|\d+(?:\.\d*)? </code></pre> (+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum. (-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use <code>\d+(?:\.\d*)?|\.\d+</code> Obviously, if you want to support negative values you need to add <code>-?</code>: <pre class="prettyprint"><code>-?(?:\.\d+|\d+(?:\.\d*)?) </code></pre>

Regex for well-known text

Tags:

regex

wkt

I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:

Click to copy

POLYGON((51.124 -3.973, 51.1 -3.012, ....))

Click to copy

MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))

among other variations.

There is a good answer here: Parsing a WKT-file which uses the regex:

Click to copy

\d+(?:\.\d*)?

From other places I have also seen

Click to copy

\d*\.\d+|\d+

and

Click to copy

(\d*\.)?\d+

These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.

To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.

424

asked Mar 12 '14 15:03

John Powell

1 Answers

It depends what number formats you need to allow, example:

Click to copy

 format 1:   22
 format 2:   22.2
 format 3:   .2
 format 4:   2.

the 1st pattern \d+(?:\.\d*)? matches 1,2,4
the 2nd pattern \d*\.\d+|\d+ matches 1,2,3
the 3rd pattern (\d*\.)?\d+ matches 1,2,3 (and have an uneeded capturing group)

Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)

Click to copy

str  |  pattern       |  state
-----+----------------+-----------------------------
123  |  \d*\.\d+|\d+  |  START
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  go to the next alternative
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK => SUCCESS

if you want to match the four cases, you can use:

Click to copy

\.\d+|\d+(?:\.\d*)?

(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+

Obviously, if you want to support negative values you need to add -?:

Click to copy

-?(?:\.\d+|\d+(?:\.\d*)?)

138

answered Sep 16 '22 22:09

Casimir et Hippolyte

Related questions
                            
                                Regex to Match Specific URL with Query String
                            
                                Stripping javascript unicode character 8206 from a string
                            
                                meaning of the letter "~" in regex [duplicate]
                            
                                Regex with negative lookahead across multiple lines
                            
                                Backward search and replace using sed
                            
                                Can't escape double quotes in RegularExpression using named parameters. C#
                            
                                regex - how to match group of unique characters of certain length
                            
                                .htaccess rule for multilingual site
                            
                                Python newbie, equal to a string?
                            
                                Java regex pattern matching (Irish car registration)
                            
                                Select previous and next word in a string
                            
                                How to escape the REPLACEMENT in a perl substitution?
                            
                                How does the dot metacharacter match newline characters?
                            
                                Replace all alphanumeric characters in a string except pattern
                            
                                Split is not working (Perl)
                            
                                Split string on the last occurrence of a character [closed]
                            
                                User defined regular expression security concerns
                            
                                Replace multiple words in a string ln java like php str_replace
                            
                                Regex to limit the instance count of any character in a string
                            
                                Overriding URLField's validation with custom validation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex for well-known text

Tags:

regex

wkt

John Powell

People also ask

1 Answers

Casimir et Hippolyte

Recent Activity

Donate For Us