Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regular expression to match _all_ whitespace characters

Tags:

java

regex

I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match   and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.

[Edit]

To clarify: I do not mean the string sequence " " I mean the sincle unicode character U+00A0 that is often represented by " ", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.

[Answer]

For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:

[\p{Z}\s]

The answer is in the comments below but since it is a bit hidden I repeat it here.

like image 799
Carsten Avatar asked Nov 30 '09 22:11

Carsten


People also ask

What is the regex for white space?

The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].

What is the white space character in the regular expressions in Java?

Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What do you use in a regular expression to match any 1 character or space?

Use square brackets [] to match any characters in a set. Use \w to match any single alphanumeric character: 0-9 , a-z , A-Z , and _ (underscore). Use \d to match any single digit. Use \s to match any single whitespace character.


1 Answers

  is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.

You are mixing abstraction levels here.

If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.

You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).

like image 158
Vinko Vrsalovic Avatar answered Sep 21 '22 17:09

Vinko Vrsalovic