Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a minimal set of unicode characters for reasonable Japanese support?

I have a mobile application that needs to be ported for a Japanese audience. Part of the application is a custom font file that needs to be extended from only containing latin-1 characters to also containing Japanese characters. I realise that this will make it rather large, but that is not todays problem.

Note that I have no control over the text to be displayed by this application, so it needs to be able to support enough to be able to display user-generated content.

Here is what I believe to be a maximal set of unicode ranges that would cover anything required of it.

 Compatability                         U+3300  -  U+33FF
 Compatability forms                   U+FE30  -  U+FE4F
 Compatability ideographs              U+F900  -  U+FAFF
 Compatability ideographs supplement  U+2F800  - U+2FA1F
 Radicals supplement                   U+2E80  -  U+2EFF
 Strokes                               U+31C0  -  U+31EF
 Symbols and punctuation               U+3000  -  U+303F
 Unified Ideographs                    U+4E00  -  U+9FBB
 Unified Ideographs ext. A             U+3400  -  U+4DB5
 Unified Ideographs ext. B            U+20000  - U+2A6D6
 Enclosed letters and months           U+3200  -  U+32FF
 Hiragana                              U+3040  -  U+309F
 Kanbun                                U+3190  -  U+319F
 Katakana                              U+30A0  -  U+30FF
 Katakana phonetic                     U+31F0  -  U+31FF

What I need to know is:

  • Is anything missing from this list?
  • Is anything obviously not required?
  • Is anything arguably non-essential, and why could it be argued as such?
like image 627
izb Avatar asked Apr 03 '09 10:04

izb


1 Answers

Summary of Essential Characters

Enclosed Alphanumerics                U+2460  -  U+2473
            "                         U+2474  -  U+24E9*
            "                         U+24EA  -  U+24FF
Miscellaneous Symbols                 U+2600  -  U+2607
            "                         U+2618  -  U+2618
            "                         U+260E  -  U+260F
            "                         U+2614  -  U+2615
            "                         U+263D  -  U+2653
            "                         U+2660  -  U+266F
Symbols and punctuation               U+3000  -  U+303F
Hiragana                              U+3040  -  U+309F
Katakana                              U+30A0  -  U+30FF
Katakana phonetic                     U+31F0  -  U+31FF
Enclosed letters and months           U+321F  -  U+325F*
            "                         U+3280  -  U+32FF*
Unified Ideographs ext. A             U+3400  -  U+4DB5
Unified Ideographs                    U+4E00  -  U+9FBB
Compatability ideographs              U+F900  -  U+FAFF
Compatability forms                   U+FE30  -  U+FE4F
Full-Width Roman                      U+FF00  -  U+FF5E
Half-Width Katakana                   U+FF61  -  U+FF9F
Full- and Half-Width Symbols          U+FFE0  -  U+FFEE
Unified Ideographs ext. B            U+20000  - U+2A6D6
Compatability ideographs supplement  U+2F800  - U+2FA1F

* = Lower priority

Full Explanation

Don't forget the full-width Roman, which are used often for the Roman alphabet in Japanese (FF00-FF5E) and half-width Katakana pages (FF61-FF9F). You will probably also need the full- and half-width symbols (FFE0-FFEE).

An argument can be made that the Kanbun annotation page (3190-319F) will generally not be used. Kanbun is and old style of Japanese which uses all Chinese characters (no Hiragana or Katakana) with a different set of grammar rules, generally taught at school. These annotation marks will not be used unless someone is trying to explain how to read/understand one of these passages, which is probably unlikely. It could be included for completeness, but probably is not a high priority.

CJK Compatability (3300-33FF) is generally used by newspapers in print media, but will almost certainly not be used by the average public (I have yet to see one on a website). In either event, they have equivalent long forms (e.g. ㌘ can be written as グラム instead), so this is also in the non-essential category.

CJK Radicals Supplement (2E80-2EFF) is also non-essential, but could be used. They are not complete characters, but the "radical" (base part) of characters. They could be used to explain the derivation of a character, but unlikely to be used in normal application of the language.

CJK Strokes (31C0-31E3) is the same as the CJK Radicals Supplement, and probably has an even less likelyhood of being used in everyday application.

The first part of Enclosed CKJ Letters and Months (3200-321E) are unnecessary. They are Korean symbols. Same with (3260-327F). The rest of the page has a low usage rate, but I would include it for completeness because someone will probably try to use one occasionally. But you can consider them lower priority.

The rest you have called out in your original list are essential.

Also missing from the list is Enclosed Alphanumerics (2460-24FF). The circled numbers (2460-2473 and 24EA-24FF) are used relatively frequently. The circled alphabet, parenthesized numbers, and numbers period (2474-24E9) could be omitted as non-essential, however.

Also, you would do well to include Miscellaneous Symbols (2600-263C), although some are used more often than others. Absolutely essential ones include some of the weather symbols (2600-2607), shamrock (2618), the telephones (260E-260F), umbrella and hot drink (2614-2615), Astrological and Zodiac symbols (263D-2653), and playing cards, hot springs, and musical symbols (2660-266F).

like image 146
lc. Avatar answered Nov 14 '22 03:11

lc.