Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to replace 'NO-BREAK SPACE'

I am looking for a regex to replace 'NO-BREAK SPACE's from a string.

There are some question on SO related to 'NO-BREAK SPACE', but none seems to point me to the right answer.

So far, i tried to use (second character of the String "A B" is a no break space) without success:

"A B".replace(new RegExp(String.fromCharCode(160),"g"),"xxx");
"A B".replace($('<b>&nbsp;</b>').text(), 'xxx');
"A B".replace(/\xA0/,'xxx');
"A B".replace(/\\xA0/,'xxx');
"A B".replace(/\u00A0/,'xxx');
"A B".replace(/\\u00A0/,'xxx');

UPDATE: Stupid me. Truth is i tested with the wrong character for quite some time.

like image 322
Thariama Avatar asked Aug 03 '15 14:08

Thariama


People also ask

How do I find a non-breaking space?

If you prefer to just search for non-breaking spaces, you can, in step 2, type Ctrl+Shift+Spacebar, which inserts a non-breaking space character (^s) in the Find What box.

How do you escape space in regex?

The backslash in a regular expression precedes a literal character. You also escape certain letters that represent common character classes, such as \w for a word character or \s for a space.

What is U flag in regex?

The u flag enables various Unicode-related features. With the "u" flag: Any Unicode code point escapes ( \u{xxxx} , \p{UnicodePropertyValue} ) will be interpreted as such instead of as literal characters. Surrogate pairs will be interpreted as whole characters instead of two separate characters.

How do you match a space in regex?

\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n\f]. That is: \s matches a space, a tab, a carriage return, a line feed, or a form feed.


2 Answers

Apart from space, NO-BREAK SPACE, etc. there are also other spaces characters that can also appear in strings.

Here is the complete Unicode list for spaces. Source: http://jkorpela.fi/chars/spaces.html

Number Character name
\u0020 space
\u00A0 no-break space
\u1680 Ogham space mark
\u180E Mongolian vowel separator
\u2000 en quad
\u2001 em quad
\u2002 en space (nut)
\u2003 em space (mutton)
\u2004 three-per-em space (thick space)
\u2005 four-per-em space (mid space)
\u2006 six-per-em space
\u2007 figure space
\u2008 punctuation space
\u2009 thin space
\u200A hair space
\u200B zero width space
\u202F narrow no-break space
\u205F medium mathematical space
\u3000 ideographic space
\uFEFF zero width no-break space

Therefore, to replace all strange spaces

.replace(/[\u00A0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF]/, " ")

From the above, you may exclude \u1680, since it's "usually not really a space but a dash".

like image 88
Rakesh Chaudhari Avatar answered Sep 22 '22 06:09

Rakesh Chaudhari


Apparently there is no unicode category that cover this use-case.

The regex in @Rakesh's answer was missing some characters from the list of unicode-space and I needed c#-flavor.

Here the list is converted to a c#-expression that produces regex-pattern:

string.Concat("{", string.Join("|", new[]
{
    new { c = '\u0020', desc = "space" },
    new { c = '\u00A0', desc = "no-break space" },
    new { c = '\u1680', desc = "Ogham space mark" },
    new { c = '\u180E', desc = "Mongolian vowel separator" },
    new { c = '\u2000', desc = "en quad" },
    new { c = '\u2001', desc = "em quad" },
    new { c = '\u2002', desc = "en space (nut)" },
    new { c = '\u2003', desc = "em space (mutton)" },
    new { c = '\u2004', desc = "three-per-em space (thick space)" },
    new { c = '\u2005', desc = "four-per-em space (mid space)" },
    new { c = '\u2006', desc = "six-per-em space" },
    new { c = '\u2007', desc = "figure space" },
    new { c = '\u2008', desc = "punctuation space" },
    new { c = '\u2009', desc = "thin space" },
    new { c = '\u200A', desc = "hair space" },
    new { c = '\u200B', desc = "zero width space" },
    new { c = '\u202F', desc = "narrow no-break space" },
    new { c = '\u205F', desc = "medium mathematical space" },
    new { c = '\u3000', desc = "ideographic space" },
    new { c = '\uFEFF', desc = "zero width no-break space" },
}
.Select(a => $"\\u{(int)a.c:X4}")
), "}")

// Become "{\u0020|\u00A0|\u1680|\u180E|\u2000|\u2001|\u2002|\u2003|\u2004|\u2005|\u2006|\u2007|\u2008|\u2009|\u200A|\u200B|\u202F|\u205F|\u3000|\uFEFF}"

For copy-paste and view in LINQPad:
.Select(a => new { a.c, num = (int)a.c, part = $"\\u{(int)a.c:X4}", a.desc })

like image 30
Grastveit Avatar answered Sep 22 '22 06:09

Grastveit