Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All the Whitespace Characters? Is it language independent?

Tags:

I was wondering if all the language treats the same set of characters as white space charactes or is there any variation.

Can anyone provide complete list of White space characters separating the one which can be entered from keyboard? If it's different, the difference and the reason would be more appropriate. Any language is helpful if you don't bring out Whitespace or its variants(if any). I certainly don't want a complete list for language like Whitespace :)

like image 565
sakibmoon Avatar asked Aug 11 '13 05:08

sakibmoon


2 Answers

Whether a particular character is categorized as a whitespace character or not should depend on the character set being used. That said, it is not impossible that a programming language can make its own definition of what constitutes whitespace.

Most modern languages use the Unicode Character set, which does have a definition for space separator characters. Any character in the Zs category is a space separator.

You can see the complete list here. In addition you can grep for ;Zs; in the official Unicode Character Database to see those characters. Note that the number of characters in this category may grow as new Unicode versions come into existence, so I will not say how many such characters exist, nor even attempt to list them.

In addition to the Zs Unicode category, Unicode also defines character properties. Among the properties defined by Unicode is a Whitespace property. As of Unicode 7.0, characters with this property include all of the characters with category Zs plus a few control characters (including U+0009, U+000A, U+000B, U+000C, U+000D, and U+0085). You can find all of the characters with the whitespace property at Unicode.org here.

Now many languages, even modern ones, have special symbols for regular expressions such as \s or [:space:] but beware, these only refer to certain characters from the ASCII set; generally these are restricted to

  • SPACE (codepoint 32, U+0020)
  • TAB (codepoint 9, U+0009)
  • LINE FEED (codepoint 10, U+000A)
  • LINE TABULATION (codepoint 11, U+000B)
  • FORM FEED (codepoint 12, U+000C)
  • CARRIAGE RETURN (codepoint 13, U+000D)

Now this list is interesting because it contains not only space separators (Zs), but also from the "Control, Other" category (Cc). This is what a programming language generally means when it uses the term "whitespace."

So probably the best way to answer your question for a "complete list" of whitespace characters is to say "it depends on what you mean." If you mean "classic whitespace" it is probably the six characters listed above. If you want something more "modern" then it is the union of those six with all the characters from the Unicode category Zs. Then again, you might need to look within other blocks, too (e.g., U+1361 as mentioned in a comment to your question by Jerry Coffin). It also depends on what you intend to do with these space characters.

Now one last thing: Unicode doesn't have every character in the world yet; it keeps growing. It is possible that someday new space characters will be added. For now, category Zs + the classics are your best bet.

like image 78
Ray Toal Avatar answered Sep 25 '22 22:09

Ray Toal


There are currently 25 Unicode whitespace characters with the following hexadecimal 'code points':

9, A, B, C, D, 20, 85, A0, 1680, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 200A, 2028, 2029, 202F, 205F, 3000 

Corresponding decimal values are:

9, 10, 11, 12, 13, 32, 133, 160, 5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287, 12288 

I originally acquired this information from Unicode.org, but my old link is no longer a working URL. Wikipedia has a nice page on the subject tho, at https://en.wikipedia.org/wiki/Whitespace_character if any are interested, which also gives 25 characters. (I have not cross-referenced that these characters are the same characters, but i trust that the Unicode Consortium has not made such a breaking, major change to their character set!)

I did find one simple page on unicode's website today, but it looks a bit more like a draft html page rather than anything supporting or claiming an official stance. But it does match what Unicode had previously posted as an official claim regarding what all of their whitespace characters are. (The link is in my comment below my answer.)

like image 34
Shawn Kovac Avatar answered Sep 23 '22 22:09

Shawn Kovac