Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for well-known text

Tags:

regex

wkt

I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:

POLYGON((51.124 -3.973, 51.1 -3.012, ....))

or

MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))

among other variations.

There is a good answer here: Parsing a WKT-file which uses the regex:

\d+(?:\.\d*)?

From other places I have also seen

\d*\.\d+|\d+

and

(\d*\.)?\d+

These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.

To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.

like image 424
John Powell Avatar asked Mar 12 '14 15:03

John Powell


People also ask

What is this regex ([?

Short for regular expression, a regex is a string of text that lets you create patterns that help match, locate, and manage text. Perl is a great example of a programming language that utilizes regular expressions. However, its only one of the many places you can find regular expressions.

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

What is the regex for special characters?

Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).


1 Answers

It depends what number formats you need to allow, example:

 format 1:   22
 format 2:   22.2
 format 3:   .2
 format 4:   2.
  • the 1st pattern \d+(?:\.\d*)? matches 1,2,4
  • the 2nd pattern \d*\.\d+|\d+ matches 1,2,3
  • the 3rd pattern (\d*\.)?\d+ matches 1,2,3 (and have an uneeded capturing group)

Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)

str  |  pattern       |  state
-----+----------------+-----------------------------
123  |  \d*\.\d+|\d+  |  START
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  go to the next alternative
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK => SUCCESS

if you want to match the four cases, you can use:

\.\d+|\d+(?:\.\d*)?

(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+

Obviously, if you want to support negative values you need to add -?:

-?(?:\.\d+|\d+(?:\.\d*)?)
like image 138
Casimir et Hippolyte Avatar answered Sep 16 '22 22:09

Casimir et Hippolyte