Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are all the Japanese whitespace characters?

I need to split a string and extract words separated by whitespace characters.The source may be in English or Japanese. English whitespace characters include tab and space, and Japanese text uses these too. (IIRC, all widely-used Japanese character sets are supersets of US-ASCII.)

So the set of characters I need to use to split my string includes normal ASCII space and tab.

But, in Japanese, there is another space character, commonly called a 'full-width space'. According to my Mac's Character Viewer utility, this is U+3000 "IDEOGRAPHIC SPACE". This is (usually) what results when a user presses the space bar while typing in Japanese input mode.

Are there any other characters that I need to consider?

I am processing textual data submitted by users who have been told to "separate entries with spaces". However, the users are using a wide variety of computer and mobile phone operating systems to submit these texts. We've already seen that users may not be aware of whether they are in Japanese or English input mode when entering this data.

Furthermore, the behavior of the space key differs across platforms and applications even in Japanese mode (e.g., Windows 7 will insert an ideographic space but iOS will insert an ASCII space).

So what I want is basically "the set of all characters that visually look like a space and might be generated when the user presses the space key, or the tab key since many users do not know the difference between a space and a tab, in Japanese and/or English".

Is there any authoritative answer to such a question?

like image 514
Mason Avatar asked Nov 29 '10 05:11

Mason


People also ask

What characters are whitespace?

Space, tab, line feed (newline), carriage return, form feed, and vertical tab characters are called "white-space characters" because they serve the same purpose as the spaces between words and lines on a printed page — they make reading easier.

What character looks like a space but isn t?

There are two characters that look like spaces but aren't: The New Line character – also known as the "carriage return". The HTML code for newline character is: 
 The Tab Character, which is what you get when you press the tab button in a text field.

Is whitespace special character?

Special characters are those characters that are neither a letter nor a number. Whitespace is also not considered a special character.

How do you count Japanese characters?

Most Japanese translation agencies count the characters in a Japanese text. So, each hiragana/katakana and each kanji is counted as one character. 400 Japanese characters is considered to be approximately 200 English words when translated.


1 Answers

You need the ASCII tab, space and non-breaking space (U+00A0), and the full-width space, which you've correctly identified as U+3000. You might possibly want newlines and vertical space characters. If your input is in unicode (not Shift-JIS, etc.) then that's all you'll need. There are other (control) characters such as \0 NULL which are sometimes used as information delimiters, but they won't be rendered as a space in East Asian text - i.e., they won't appear as white-space.

edit: Matt Ball has a good point in his comment, but, as his example illustrates, many regex implementations don't deal well with full-width East Asian punctuation. In this connection, it's worth mentioning that Python's string.whitespace won't cut the mustard either.

like image 131
simon Avatar answered Oct 06 '22 00:10

simon