While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on Java's spartan definition of String.trim()
which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on the Character class would do the job for me.
That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:
It is a Unicode space character (
SPACE_SEPARATOR
,LINE_SEPARATOR
, orPARAGRAPH_SEPARATOR
) but is not also a non-breaking space ('\u00A0'
,'\u2007'
,'\u202F'
).
Why is that?
The implementation of corresponding .NET equivalent is less discriminating.
A nonbreaking space is the same width as a word space, but it prevents the text from flowing to a new line or page.
A character in Java can be considered as a whitespace character if one of the following criteria is satisfied: The character is a Unicode space character (either a SPACE_SEPARATOR, or a LINE_SEPARATOR, or a PARAGRAPH_SEPARATOR) but it must not be a non-breaking space.
While   is a non breaking space (a space that does not want to be treated as whitespace), you can trim a string while preserving every   within the string with a simple regex: string.replaceAll("(^\\h*)|(\\h*$)","") \h is a horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
The \S metacharacter matches non-whitespace characters. Whitespace characters can be: A space character.
Character.isWhitespace(char)
is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.
Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.
Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With