I need some help figuring out the regex for XML character references to control characters, in decimal or hex.
These sequences look like the following:
�




In other words, they are an ampersand, followed by a pound, followed by an optional 'x' to denote hexadecimal mode, followed by 1 to 4 decimal (or hexadecimal) digits, followed by a semicolon.
I'm specifically trying to identify those sequences where they contain (inclusive) numbers from decimal 0 to 31, or hexadecimal 0 to 1F.
Can anyone figure out the regex for this??
If you use a zero-width lookahead assertion to restrict the number of digits, you can write the rest of the pattern without worrying about the length restriction. Try this:
&#(?=x?[0-9A-Fa-f]{1,4})0*([12]?\d|3[01]|x0*1?[0-9A-Fa-f]);
Explanation:
(?=x?[0-9A-Fa-f]{1,4}) #Restricts the numeric portion to at most four digits, including leading zeroes.
0* #Consumes leading zeroes if there is no x.
[12]?\d #Allows decimal numbers 0 - 29, inclusive.
3[01] #Allows decimal 30 or 31.
x0*1?[0-9A-Fa-f] #Allows hexadecimal 0 - 1F, inclusive, regardless of case or leading zeroes.
This pattern allows leading zeroes after the x
, but the (?=x?[0-9A-Fa-f]{1,4})
part prevents them from occurring before an x
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With