Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a Regular Expression to Find XML character references for control characters

I need some help figuring out the regex for XML character references to control characters, in decimal or hex.

These sequences look like the following:

�





In other words, they are an ampersand, followed by a pound, followed by an optional 'x' to denote hexadecimal mode, followed by 1 to 4 decimal (or hexadecimal) digits, followed by a semicolon.

I'm specifically trying to identify those sequences where they contain (inclusive) numbers from decimal 0 to 31, or hexadecimal 0 to 1F.

Can anyone figure out the regex for this??

like image 211
Ken Mason Avatar asked Feb 23 '23 23:02

Ken Mason


1 Answers

If you use a zero-width lookahead assertion to restrict the number of digits, you can write the rest of the pattern without worrying about the length restriction. Try this:

&#(?=x?[0-9A-Fa-f]{1,4})0*([12]?\d|3[01]|x0*1?[0-9A-Fa-f]);

Explanation:

(?=x?[0-9A-Fa-f]{1,4})  #Restricts the numeric portion to at most four digits, including leading zeroes.
0*                      #Consumes leading zeroes if there is no x.
[12]?\d                 #Allows decimal numbers 0 - 29, inclusive.
3[01]                   #Allows decimal 30 or 31.
x0*1?[0-9A-Fa-f]        #Allows hexadecimal 0 - 1F, inclusive, regardless of case or leading zeroes.

This pattern allows leading zeroes after the x, but the (?=x?[0-9A-Fa-f]{1,4}) part prevents them from occurring before an x.

like image 133
Justin Morgan Avatar answered May 09 '23 20:05

Justin Morgan