Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does string.replace(/\W*/g,'_') prepend all characters?

I've been learning regexp in js an encountered a situation that I didn't understand.

I ran a test of the replace function with the following regexp:

/\W*/g

And expected it prepend the beginning of the string and proceed to replace all non-word characters.

The Number is (123)(234)

would become:

_The_Number_is__123___234_

This would be prepending the string because it has at least zero instances, and then replacing all non-breaking spaces and non-word characters.

Instead, it prepended every character and replaced all non-word characters.

_T_h_e__N_u_m_b_e_r__i_s__1_2_3__2_3_4__

Why did it do this?

like image 984
Judd Franklin Avatar asked Mar 03 '17 21:03

Judd Franklin


People also ask

What does it mean /[ w ]/ G?

\W means "any non word character" [\W_] means "any non word character or a _ /[\W_]/g find globally any non word character or _

How do you replace all characters in a string?

To replace all occurrences of a substring in a string by a new one, you can use the replace() or replaceAll() method: replace() : turn the substring into a regular expression and use the g flag. replaceAll() method is more straight forward.

What does replace G ') do in JavaScript?

The "g" represents the "global modifier". This means that your replace will replace all copies of the matched string with the replacement string you provide. A list of useful modifiers: g - Global replace.

What is W+ regex?

\w+ matches 1 or more word characters (same as [a-zA-Z0-9_]+ ). [.-]? matches an optional character . or - . Although dot ( . ) has special meaning in regex, in a character class (square brackets) any characters except ^ , - , ] or \ is a literal, and do not require escape sequence.


2 Answers

The problem is the meaning of \W*. It means "0 or more non-word characters". This means that the empty string "" would match, given that it is indeed 0 non-word characters.

So the regex matches before every character in the string and at the end, hence why all the replacements are done.

You want either /\W/g (replacing each individual non-word character) or /\W+/g (replacing each set of consecutive non-word characters).

"The Number is (123)(234)".replace(/\W/g, '_')  // "The_Number_is__123__234_"
"The Number is (123)(234)".replace(/\W+/g, '_') // "The_Number_is_123_234_"
like image 162
lonesomeday Avatar answered Nov 18 '22 22:11

lonesomeday


TL;DR

  1. Never use a pattern that can match an empty string in a regex replace method if your aim is to replace and not insert text

  2. To replace all separate occurrences of a non-word char in a string, use .replace(/\W/g, '_') (that is, remove * quantifier that matches zero or more occurrences of the quantified subpattern)

  3. To replace all chunks of non-word chars in a string with a single pattern, use .replace(/\W+/g, '_') (that is, replace * quantifier with + that matches one or more occurrences of the quantified subpattern)

    Note: the solution below is tailored for the OP much more specific requirements.

A string is parsed by the JS regex engine as a sequence of chars and locations in between them. See the following diagram where I marked locations with hyphens:

  -T-h-e- -N-u-m-b-e-r- -i-s- -(-1-2-3-)-(-2-3-4-)-
  |||                                             |
  ||Location between T and h, etc. .............  |
  |1st symbol                                     |
start                     ->                     end

All these positions can be analyzed and matched with a regex.

Since /\W*/g is a regex matching all non-overlapping occurrences (due to g modifier) of 0 and more (due to * quantifier) non-word chars, all the positions before word chars are matched. Between T and h, there is a location tested with the regex, and as there is no non-word char (h is a word char), the empty match is returned (as \W* can match an empty string).

So, you need to replace the start of string and each non-word char with a _. Naive approach is to use .replace(/\W|^/g, '_'). However, there is a caveat: if a string starts with a non-word character, no _ will get appended at the start of the string:

console.log("Hi there.".replace(/\W|^/g, '_'));  // _Hi_there_
console.log(" Hi there.".replace(/\W|^/g, '_')); // _Hi_there_

Note that here, \W comes first in the alternation and "wins" when matching at the beginning of the string: the space is matched and then no start position is found at the next match iteration.

You may now think you can match with /^|\W/g. Look here:

console.log("Hi there.".replace(/^|\W/g, '_'));  // _Hi_there_
console.log(" Hi there.".replace(/^|\W/g, '_')); // _ Hi_there_

The _ Hi_there_ second result shows how JS regex engine handles zero-width matches during a replace operation: once a zero-width match (here, it is the position at the start of the string) is found, the replacement occurs, and the RegExp.lastIndex property is incremented, thus proceeding to the position after the first character! That is why the first space is preserved, and no longer matched with \W.

A solution is to use a consuming pattern that will not allow zero-width matches:

console.log("Hi there.".replace(/^(\W?)|\W/g, function($0,$1) { return $1 ? "__" : "_"; }));
console.log(" Hi there.".replace(/^(\W?)|\W/g, function($0,$1) { return $1 ? "__" : "_"; }));
like image 28
Wiktor Stribiżew Avatar answered Nov 18 '22 22:11

Wiktor Stribiżew