When constructing a regular expression for matching a list of candidate strings, how to ensure all the strings can be matched? For example,
This regular expression (?:O|SO|S|OH|OS)(?:\s?[-+*°.1-9]){0,4} can match all the examples below
O 4 2 -
O 2 -
SO 4 * - 2
S 2-
However, if I swap S and SO, the resulting regular expression (?:O|S|SO|OH|OS)(?:\s?[-+*°.1-9]){0,4} failed to match the SO 4 * - 2 as a whole, instead it is separated into two matches: S and O 4 * - 2.
So my confusion is how to order the list of candidate strings in the regular expression, so that all of them can be safely and uniquely matched? Since the actual list of candidate strings in my project is a bit more complicated than the example, is there a sorting algorithm that can achieve this?
You could repeat the character class 1 or more times to prevent matching only single uppercase characters from the alternation and reorder the alternatives:
\b(?:SO|OS|O[HS]|[SO])(?:\s?[-+*°.1-9]){1,4}
The pattern matches:
\b A word boundary to prevent a partial word match(?: Non capture group for the alternatives
SO|OS|O[HS]|[SO] Match either SO OS OH OS S O) Close the non capture group(?:\s?[-+*°.1-9]){1,4} Repeat 1-4 times an optional whitespace char and 1 of the listed charactersSee a regex101 demo.
The regular expression engine tries to match the alternatives in the order in which they are specified.
So when the pattern is (S|SO)? it matches S immediately and continues trying to find matches. The next bit of the input string is O4*-2 which cannot be matched.
So, I think the trick here to match all given string.
(?:O|S)(?:O|H|S)*(?:\s?[-+*°.1-9]){0,4}
Demo: https://regex101.com/r/3AwQP7/1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With