Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is an underscore (_) not regarded as a non-word character?

Why is an underscore (_) not regarded as a non-word character? This regexp \W matches all non-word character but not the underscore.

like image 660
Oghenebrume Avatar asked Mar 28 '18 11:03

Oghenebrume


People also ask

Is underscore a word character?

The underscore character, _, originally appeared on the typewriter and was primarily used to emphasise words as in the proofreader's convention.

What is a non-word character?

Non-word characters include characters other than alphanumeric characters ( - , - and - ) and underscore (_).

Is underscore a special character in regex?

Regex doesn't recognize underscore as special character.

What is an underscore used for?

The underscore ( _ ) is also known as an understrike, underbar, or underline, and is a character that was originally on a typewriter keyboard and was used simply to underline words or numbers for emphasis. Today, the character is used to create visual spacing in a sequence of words where whitespace is not permitted.


2 Answers

Referring to Jeffrey Friedl's book about Regular Expressions, this was a change in Perl Regular Expressions, originally. Back to 1988 according to characters that were allowed to name a Perl variable [Page 89]:

Perl 2 was released in June 1988. Larry had replaced the regex code entirely, this time using a greatly enhanced version of the Henry Spencer package mentioned in the previous section. You could still have at most nine sets of parentheses, but now you could use | inside them. Support for \d and \s was added, and support for \w was changed to include an underscore, since then it would match what characters were allowed in a Perl variable name.

like image 135
revo Avatar answered Sep 28 '22 10:09

revo


\W is defined as [^A-Za-z0-9_].

It is the opposite of \w which is [A-Za-z0-9_] and means "a word character".

It is not about words as you perceive them in a spoken language. The "word" here means an identifier. Most programming languages allow (uppercase and lowercase) letters, digit and underscores (_) in identifiers.

like image 39
axiac Avatar answered Sep 28 '22 09:09

axiac