Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is non-breaking space not a whitespace character in Java?

Tags:

java

unicode

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on Java's spartan definition of String.trim() which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on the Character class would do the job for me.

That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').

Why is that?

The implementation of corresponding .NET equivalent is less discriminating.

like image 947
Palimondo Avatar asked Jun 29 '09 21:06

Palimondo


People also ask

Is non-breaking space a whitespace?

A nonbreaking space is the same width as a word space, but it prevents the text from flowing to a new line or page.

Is space a whitespace in Java?

A character in Java can be considered as a whitespace character if one of the following criteria is satisfied: The character is a Unicode space character (either a SPACE_SEPARATOR, or a LINE_SEPARATOR, or a PARAGRAPH_SEPARATOR) but it must not be a non-breaking space.

What is non-breaking space Java?

While   is a non breaking space (a space that does not want to be treated as whitespace), you can trim a string while preserving every   within the string with a simple regex: string.replaceAll("(^\\h*)|(\\h*$)","") \h is a horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]

Which is not a whitespace character?

The \S metacharacter matches non-whitespace characters. Whitespace characters can be: A space character.


1 Answers

Character.isWhitespace(char) is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

like image 86
Steve McLeod Avatar answered Oct 13 '22 23:10

Steve McLeod