Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to Match Horizontal White Spaces

I need a regex in Python2 to match only horizontal white spaces not newlines.

\s matches all whitespaces including newlines.

>>> re.sub(r"\s", "", "line 1.\nline 2\n")
'line1.line2'

\h does not work at all.

>>> re.sub(r"\h", "", "line 1.\nline 2\n")
'line 1.\nline 2\n'

[\t ] works but I am not sure if I am missing other possible white space characters especially in Unicode. Such as \u00A0 (non breaking space) or \u200A (hair space). There are much more white space characters at the following link: https://www.cs.tut.fi/~jkorpela/chars/spaces.html (dead link)

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

Do you have any suggestions?

like image 735
Memduh Avatar asked Sep 07 '17 12:09

Memduh


People also ask

What is the regex for white space?

The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].

What is horizontal whitespace?

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page.

What does \d do in regex?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.

Does \s match \n?

\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character. \t, \n, \r -- tab, newline, return. \d -- decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)


2 Answers

I ended up using [^\S\n] instead of specifying all Unicode white spaces.

>>> re.sub(r"[^\S\n]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\n'

>>> re.sub(r"[\t ]", "", u"line 1.\nline 2\n\u00A0\u200A\n", flags=re.UNICODE)
u'line1.\nline2\n\xa0\u200a\n'

It works as expected.

like image 53
Memduh Avatar answered Oct 13 '22 09:10

Memduh


If you only want to match actual spaces, try a plain ( )+ (brackets for readability only*). If you want to match spaces and tabs, try [ \t]+ (+ so that you also match a sequence of e.g. 3 space characters.

Now there are in fact other whitespace characters in unicode, that's true. You are, however, highly unlikely to encounter any of those in written code, and also pretty unlikely to encounter any of the less common whitespace chars in other texts.

If you want to, you can include \u00A0 (non-breaking space, fairly common in scientific papers and on some websites. This is the HTML  ), en-space \u2002 ( ), em-space \u2003 ( ) or thin space \u2009 ( ).

You can find a variety of other unicode whitespace characters on Wikipedia, but I highly doubt it's necessary to include them. I'd just stick to space, tab and maybe non-breaking space (i.e. [ \t\u00A0]+).

What do you intend to match with \h, anyway? It's not a valid "symbol" in regex, as far as I know.

 

*Stackoverflow doesn't display spaces on the edge of inline code

like image 32
PixelMaster Avatar answered Oct 13 '22 10:10

PixelMaster