I'm curious on the algorithm for deciding which characters to include, in a regex when using a -
...
Example: [a-zA-Z0-9]
This matches any character of any case, a through z, and numbers 0 through 9.
I had originally thought that they were used sort of like macros, for example, a-z
translates to a,b,c,d,e
etc.. but after I saw the following in an open source project,
text.tr('A-Za-z1-90', 'Ⓐ-Ⓩⓐ-ⓩ①-⑨⓪')
my paradigm on regex's has changed entirely, because these are characters that are not your typical characters, so how the heck did this work correctly, i thought to myself.
My theory is that the -
literally means
Any ASCII value between the left character, and the right character. (e.g. a-z [97-122])
Could anybody confirm if my theory is correct? Does the regex pattern in-fact calculate using the character codes, between any character?
Furthermore, if it IS correct, could you perform a regex match like,
A-z
because A
is 65
, and z
is 122
so theoretically, it should also match all characters between those values.
From MSDN - Character Classes in Regular Expressions (bold is mine):
The syntax for specifying a range of characters is as follows:
[firstCharacter-lastCharacter]
where
firstCharacter
is the character that begins the range andlastCharacter
is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-
), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.
So your assumption is correct, but the effect is, in fact, wider: Unicode character codes, not just ASCII.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With