Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What literal characters should be escaped in a regex?

I just wrote a regex for use with the php function preg_match that contains the following part:

[\w-.] 

To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into

[\w\-.] 

Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?

like image 667
Pelle Avatar asked Mar 30 '11 08:03

Pelle


People also ask

What is escaped in regex?

The \ is known as the escape code, which restore the original literal meaning of the following character. Similarly, * , + , ? (occurrence indicators), ^ , $ (position anchors) have special meaning in regex. You need to use an escape code to match with these characters.

Do I need to escape period in regex?

(dot) metacharacter, and can match any single character (letter, digit, whitespace, everything). You may notice that this actually overrides the matching of the period character, so in order to specifically match a period, you need to escape the dot by using a slash \. accordingly.

Do we need to escape Colon in regex?

Colon does not have special meaning in a character class and does not need to be escaped.


2 Answers

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • \ (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([\w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [\^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]
like image 142
Bart Kiers Avatar answered Oct 09 '22 04:10

Bart Kiers


[\w.-] 
  • the . usually means any character but between [] has no special meaning
  • - between [] indicates a range unless if it's escaped or either first or last character between []
like image 24
bw_üezi Avatar answered Oct 09 '22 02:10

bw_üezi