Whitespace Matching Regex - Java

People also ask

How do you match a space in regex?

If you're looking for a space, that would be " " (one space). If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).

What is the regex for space in Java?

Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What does \\ s+ mean in regex?

The Java regex pattern \\s+ is used to match multiple whitespace characters when applying a regex search to your specified value. The pattern is a modified version of \\s which is used to match a single whitespace character. The difference is easy to see with an example.

What is the use of \\ s+ in Java?

\\s - matches single whitespace character. \\s+ - matches sequence of one or more whitespace characters.

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode defines 26 code points as \p{White_Space}: 20 of them are various sorts of \pZ GeneralCategory=Separator, and the remaining 6 are \p{Cc} GeneralCategory=Control.

White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as the pattern in your replaceAll.

Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.

And if you think white space is bad, you should see what you have to do to get \w and \b to finally behave properly!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.

Yeah, you need to grab the result of matcher.replaceAll():

String result = matcher.replaceAll(" ");
System.out.println(result);

For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:

"Message author: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."

Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs} group.

Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

Also worth knowing, if you've used the trim() string function you should take a look at the (relatively new) strip(), stripLeading(), and stripTrailing() functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java's Character.isWhitespace() function.

Related questions
                            
                                Catching java.lang.OutOfMemoryError?
                            
                                Amazon S3 upload file and get URL
                            
                                Is it possible to view bytecode of Class file? [duplicate]
                            
                                HQL ERROR: Path expected for join
                            
                                Maven build failed: "Unable to locate the Javac Compiler in: jre or jdk issue"
                            
                                How can I count the number of matches for a regex?
                            
                                javax.xml.bind.UnmarshalException: unexpected element (uri:"", local:"Group")
                            
                                How to assign bean's property an Enum value in Spring config file?
                            
                                java.util.Objects.isNull vs object == null
                            
                                Converting many 'if else' statements to a cleaner approach [duplicate]
                            
                                Convert float to double without losing precision
                            
                                Eclipse: Error ".. overlaps the location of another project.." when trying to create new project
                            
                                Comparing boxed Long values 127 and 128
                            
                                Spring Boot - inject map from application.yml
                            
                                Download and open PDF file using Ajax
                            
                                Populating a ListView using an ArrayList?
                            
                                What is float in Java?
                            
                                Group by multiple field names in java 8
                            
                                Java boolean getters "is" vs "are"
                            
                                Transaction marked as rollback only: How do I find the cause

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Whitespace Matching Regex - Java

Tags:

java

regex

whitespace

People also ask

Recent Activity

Donate For Us