Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I relate Unicode blocks to Languages/Scripts?

I am trying to find a resource that can be used to connect Languages (or more probably Scripts) to blocks of Unicode characters. Such a resource would be used to lookup questions such as "What Unicode Blocks are used in French?" or "What languages use the block from 0A80-0AFF (http://unicodinator.com/#Block-Gujarati)?" Do you know of such a resource?

I would have expected to be able to find this information easily at unicode.org. I was quickly able to find a great table that relates Country Codes to Languages (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html). But I've spent quite a bit of time poking around with no luck finding something that relates Unicode Blocks to Languages. Its possible I've got a terminology issue blocking me from connecting the dots here...

I am not picky about exactly what is meant by "language" (Java Locale code or ISO 639 code or whatever) in this case. I also understand that there may not be exact answers because, for instance, an Arabic document can contain Latin and other text in addition to characters from the Arabic blocks (http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement). But surely there must be some table that says "these languages go with these blocks"... I'm also not picky about the format (XML, CSV, whatever), I can easily transform this into data I can use for my application. And again, I do realize the reference would probably connect Scripts to Blocks, not Languages (though Scripts can be mapped to Languages).

I do realize this will be a many-to-many table (since many languages use characters from multiple blocks, and many blocks are used by multiple languages); I do realize this cannot be precisely answered since Unicode codepoints are not language specific -- however, neither can the question of "what languages are there in this country" (answer is probably "most of them" for most countries), yet a table like this (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) is still possible to create, meaningful and useful.

As to why I'd want such a thing: I would like to enhance http://unicodinator.com with global heat-maps for the code blocks, and lists of languages; I also have a game concept I am tinkering with. Beyond that, there are probably many other uses other people could have for this (font creation? heuristic, quick, best-guess language detection now that the Google Translate API is going away? research projects?).

like image 886
jwl Avatar asked Jun 21 '11 22:06

jwl


1 Answers

I got an answer from Unicode.org themselves! In the CLDR subproject, there are documents such as:

  • http://unicode.org/cldr/trac/browser/trunk/common/main/ar.xml
  • http://unicode.org/cldr/trac/browser/trunk/common/main/fr.xml

for each language id, which you can search for "exemplarCharacters":

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F]</exemplarCharacters>
<exemplarCharacters type="currencySymbol" draft="contributed">[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="index" draft="contributed">[ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي]</exemplarCharacters>

Or, there is this page: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html with what looks like all of them. I will work on reshuffling this data into a langid -> blockid map of some kind, at which I will probably aware @borrible the "Answer" (rather than make mine the answer).

like image 164
jwl Avatar answered Sep 18 '22 21:09

jwl