Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do dashes work in regex?

Tags:

regex

I'm curious on the algorithm for deciding which characters to include, in a regex when using a -...

Example: [a-zA-Z0-9]

This matches any character of any case, a through z, and numbers 0 through 9.

I had originally thought that they were used sort of like macros, for example, a-z translates to a,b,c,d,e etc.. but after I saw the following in an open source project,

text.tr('A-Za-z1-90', 'Ⓐ-Ⓩⓐ-ⓩ①-⑨⓪')

my paradigm on regex's has changed entirely, because these are characters that are not your typical characters, so how the heck did this work correctly, i thought to myself.

My theory is that the - literally means

Any ASCII value between the left character, and the right character. (e.g. a-z [97-122])

Could anybody confirm if my theory is correct? Does the regex pattern in-fact calculate using the character codes, between any character?

Furthermore, if it IS correct, could you perform a regex match like,

A-z

because A is 65, and z is 122 so theoretically, it should also match all characters between those values.

like image 603
ddavison Avatar asked Feb 17 '23 04:02

ddavison


1 Answers

From MSDN - Character Classes in Regular Expressions (bold is mine):

The syntax for specifying a range of characters is as follows:

[firstCharacter-lastCharacter]

where firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.

So your assumption is correct, but the effect is, in fact, wider: Unicode character codes, not just ASCII.

like image 175
acdcjunior Avatar answered Mar 03 '23 07:03

acdcjunior