Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is only ) a special character and not } or ]?

Tags:

regex

I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.

In the second chapter, Jan has a section on "special characters:"

Special Characters

Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.

(emphasis mine)

I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?

like image 206
scohe001 Avatar asked Sep 20 '18 19:09

scohe001


People also ask

What is an example of a special character?

A special character is one that is not considered a number or letter. Symbols, accent marks, and punctuation marks are considered special characters. Similarly, ASCII control characters and formatting characters like paragraph marks are also special characters.

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

What is the special character & used for?

The ampersand, also known as the and sign, is the logogram &, representing the conjunction "and". It originated as a ligature of the letters et—Latin for "and". This article contains phonetic transcriptions in the International Phonetic Alphabet (IPA). For an introductory guide on IPA symbols, see Help:IPA.


1 Answers

Short answer

The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.

Full answer

First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.

What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.

} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.

] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.

Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.

I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.

I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).

But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.

So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).

But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.

like image 118
Jan Goyvaerts Avatar answered Sep 28 '22 18:09

Jan Goyvaerts