Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking up unicode character set of language in JS

How can I find information about a Unicode character(e.g. character set it belongs to) in Java script ?

E.g.

00e9  LATIN SMALL LETTER E WITH ACUTE
0bf2  TAMIL NUMBER ONE THOUSAND

I am aware of a way to find details about a Unicode code point in python, using theunicodedata library. Is there a way to find out this information in JS?

PS: I am using this for chrome extension development, so a solution using their APIs is also good.

like image 721
varrunr Avatar asked Feb 20 '23 13:02

varrunr


1 Answers

English-language text is dominated by code points from the Latin, Common, and Inherited scripts, and in some corpora, also Greek.

For example, the PubMed Open Access collection, which is a very large collection of all English-language text, is filled with non-ASCII code points. Fully 90% of these are accounted for by only 36 distinct code points, as follows:

rank  percent cumulative  code glyph  GC=??   Name
---------------------------------------------------------------------
   1  18.553%  18.553%  U+02013 ‹–›  GC=Pd    EN DASH
   2   7.422%  25.974%  U+000A0 ‹ ›  GC=Zs    NO-BREAK SPACE
   3   7.033%  33.007%  U+000B1 ‹±›  GC=Sm    PLUS-MINUS SIGN
   4   5.461%  38.469%  U+02212 ‹−›  GC=Sm    MINUS SIGN
   5   4.196%  42.664%  U+02003 ‹ ›  GC=Zs    EM SPACE
   6   3.682%  46.346%  U+003BC ‹μ›  GC=Ll    GREEK SMALL LETTER MU
   7   3.619%  49.965%  U+003B2 ‹β›  GC=Ll    GREEK SMALL LETTER BETA
   8   3.568%  53.534%  U+003B1 ‹α›  GC=Ll    GREEK SMALL LETTER ALPHA
   9   3.426%  56.959%  U+0200A ‹ ›  GC=Zs    HAIR SPACE
  10   3.221%  60.181%  U+000B0 ‹°›  GC=So    DEGREE SIGN
  11   2.931%  63.112%  U+02009 ‹ ›  GC=Zs    THIN SPACE
  12   2.620%  65.732%  U+02019 ‹’›  GC=Pf    RIGHT SINGLE QUOTATION MARK
  13   2.506%  68.238%  U+02032 ‹′›  GC=Po    PRIME
  14   2.441%  70.679%  U+000D7 ‹×›  GC=Sm    MULTIPLICATION SIGN
  15   2.042%  72.722%  U+0201D ‹”›  GC=Pf    RIGHT DOUBLE QUOTATION MARK
  16   2.039%  74.761%  U+0201C ‹“›  GC=Pi    LEFT DOUBLE QUOTATION MARK
  17   1.536%  76.296%  U+00394 ‹Δ›  GC=Lu    GREEK CAPITAL LETTER DELTA
  18   1.415%  77.712%  U+000B5 ‹µ›  GC=Ll    MICRO SIGN
  19   1.337%  79.049%  U+003B3 ‹γ›  GC=Ll    GREEK SMALL LETTER GAMMA
  20   1.210%  80.259%  U+000E9 ‹é›  GC=Ll    LATIN SMALL LETTER E WITH ACUTE
  21   1.152%  81.410%  U+02014 ‹—›  GC=Pd    EM DASH
  22   1.135%  82.546%  U+02018 ‹‘›  GC=Pi    LEFT SINGLE QUOTATION MARK
  23   0.998%  83.543%  U+000A9 ‹©›  GC=So    COPYRIGHT SIGN
  24   0.710%  84.253%  U+02265 ‹≥›  GC=Sm    GREATER-THAN OR EQUAL TO
  25   0.600%  84.853%  U+000F6 ‹ö›  GC=Ll    LATIN SMALL LETTER O WITH DIAERESIS
  26   0.599%  85.452%  U+000B7 ‹·›  GC=Po    MIDDLE DOT
  27   0.597%  86.049%  U+02022 ‹•›  GC=Po    BULLET
  28   0.594%  86.644%  U+0223C ‹∼›  GC=Sm    TILDE OPERATOR
  29   0.573%  87.217%  U+003BA ‹κ›  GC=Ll    GREEK SMALL LETTER KAPPA
  30   0.569%  87.785%  U+000FC ‹ü›  GC=Ll    LATIN SMALL LETTER U WITH DIAERESIS
  31   0.493%  88.278%  U+02264 ‹≤›  GC=Sm    LESS-THAN OR EQUAL TO
  32   0.440%  88.718%  U+000AE ‹®›  GC=So    REGISTERED SIGN
  33   0.433%  89.152%  U+000E4 ‹ä›  GC=Ll    LATIN SMALL LETTER A WITH DIAERESIS
  34   0.422%  89.573%  U+02020 ‹†›  GC=Po    DAGGER
  35   0.407%  89.980%  U+003B4 ‹δ›  GC=Ll    GREEK SMALL LETTER DELTA

One way to detect those would be to use the Unicode regular expression that says a character must either be from the Latin, Greek, Common, or Inherited scripts.

In this corpus, the top four comprise will over 99% of the code points. However, there are also a great many super-low-frequency code points in this dataset that fall outside those four scripts (e.g. Cyrillic, Han, Kana, Hangul, etc.). You would throw those out as false negatives if you restricted input to the four ultra-common scripts previously listed. There are 239 such distinct code points in this dataset, of which the top 50 most frequent are the following:

rank  percent cumulative  code glyph  GC=??   Name
---------------------------------------------------------------------
 295   0.002%  99.828%  U+00424 ‹Ф›  GC=Lu    CYRILLIC CAPITAL LETTER EF
 381   0.001%  99.916%  U+0043A ‹к›  GC=Ll    CYRILLIC SMALL LETTER KA
 454   0.000%  99.949%  U+00413 ‹Г›  GC=Lu    CYRILLIC CAPITAL LETTER GHE
 491   0.000%  99.959%  U+0AD6D ‹국›  GC=Lo    HANGUL SYLLABLE GUG
 499   0.000%  99.961%  U+003EC ‹Ϭ›  GC=Lu    COPTIC CAPITAL LETTER SHIMA
 513   0.000%  99.965%  U+00406 ‹І›  GC=Lu    CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
 528   0.000%  99.968%  U+00416 ‹Ж›  GC=Lu    CYRILLIC CAPITAL LETTER ZHE
 534   0.000%  99.969%  U+00430 ‹а›  GC=Ll    CYRILLIC SMALL LETTER A
 539   0.000%  99.970%  U+0041F ‹П›  GC=Lu    CYRILLIC CAPITAL LETTER PE
 545   0.000%  99.971%  U+00421 ‹С›  GC=Lu    CYRILLIC CAPITAL LETTER ES
 553   0.000%  99.972%  U+0D55C ‹한›  GC=Lo    HANGUL SYLLABLE HAN
 555   0.000%  99.972%  U+00404 ‹Є›  GC=Lu    CYRILLIC CAPITAL LETTER UKRAINIAN IE
 566   0.000%  99.974%  U+0C5B4 ‹어›  GC=Lo    HANGUL SYLLABLE EO
 567   0.000%  99.974%  U+0041A ‹К›  GC=Lu    CYRILLIC CAPITAL LETTER KA
 568   0.000%  99.974%  U+0041B ‹Л›  GC=Lu    CYRILLIC CAPITAL LETTER EL
 571   0.000%  99.975%  U+0B2C8 ‹니›  GC=Lo    HANGUL SYLLABLE NI
 575   0.000%  99.975%  U+0AE4C ‹까›  GC=Lo    HANGUL SYLLABLE GGA
 578   0.000%  99.976%  U+00428 ‹Ш›  GC=Lu    CYRILLIC CAPITAL LETTER SHA
 579   0.000%  99.976%  U+00454 ‹є›  GC=Ll    CYRILLIC SMALL LETTER UKRAINIAN IE
 585   0.000%  99.977%  U+00418 ‹И›  GC=Lu    CYRILLIC CAPITAL LETTER I
 587   0.000%  99.977%  U+0B2E4 ‹다›  GC=Lo    HANGUL SYLLABLE DA
 600   0.000%  99.978%  U+00440 ‹р›  GC=Ll    CYRILLIC SMALL LETTER ER
 610   0.000%  99.980%  U+00457 ‹ї›  GC=Ll    CYRILLIC SMALL LETTER YI
 614   0.000%  99.980%  U+0C74C ‹음›  GC=Lo    HANGUL SYLLABLE EUM
 623   0.000%  99.981%  U+0BD80 ‹부›  GC=Lo    HANGUL SYLLABLE BU
 624   0.000%  99.981%  U+0C545 ‹악›  GC=Lo    HANGUL SYLLABLE AG
 625   0.000%  99.981%  U+0C778 ‹인›  GC=Lo    HANGUL SYLLABLE IN
 640   0.000%  99.982%  U+0C5D0 ‹에›  GC=Lo    HANGUL SYLLABLE E
 641   0.000%  99.983%  U+0C744 ‹을›  GC=Lo    HANGUL SYLLABLE EUL
 645   0.000%  99.983%  U+00438 ‹и›  GC=Ll    CYRILLIC SMALL LETTER I
 664   0.000%  99.984%  U+0041C ‹М›  GC=Lu    CYRILLIC CAPITAL LETTER EM
 665   0.000%  99.984%  U+00436 ‹ж›  GC=Ll    CYRILLIC SMALL LETTER ZHE
 674   0.000%  99.985%  U+0C774 ‹이›  GC=Lo    HANGUL SYLLABLE I
 678   0.000%  99.985%  U+00431 ‹б›  GC=Ll    CYRILLIC SMALL LETTER BE
 679   0.000%  99.986%  U+00435 ‹е›  GC=Ll    CYRILLIC SMALL LETTER IE
 689   0.000%  99.986%  U+0B300 ‹대›  GC=Lo    HANGUL SYLLABLE DAE
 690   0.000%  99.986%  U+0BD84 ‹분›  GC=Lo    HANGUL SYLLABLE BUN
 691   0.000%  99.986%  U+0C678 ‹외›  GC=Lo    HANGUL SYLLABLE OE
 696   0.000%  99.987%  U+005DB ‹כ›  GC=Lo    HEBREW LETTER KAF
 703   0.000%  99.987%  U+0B85C ‹로›  GC=Lo    HANGUL SYLLABLE RO
 711   0.000%  99.988%  U+0041D ‹Н›  GC=Lu    CYRILLIC CAPITAL LETTER EN
 712   0.000%  99.988%  U+004D9 ‹ә›  GC=Ll    CYRILLIC SMALL LETTER SCHWA
 725   0.000%  99.988%  U+0B294 ‹는›  GC=Lo    HANGUL SYLLABLE NEUN
 726   0.000%  99.988%  U+0B9CC ‹만›  GC=Lo    HANGUL SYLLABLE MAN
 727   0.000%  99.988%  U+0C11C ‹서›  GC=Lo    HANGUL SYLLABLE SEO
 728   0.000%  99.989%  U+0C2B5 ‹습›  GC=Lo    HANGUL SYLLABLE SEUB
 729   0.000%  99.989%  U+0C601 ‹영›  GC=Lo    HANGUL SYLLABLE YEONG
 741   0.000%  99.989%  U+00441 ‹с›  GC=Ll    CYRILLIC SMALL LETTER ES
 742   0.000%  99.989%  U+00444 ‹ф›  GC=Ll    CYRILLIC SMALL LETTER EF
 743   0.000%  99.989%  U+004B0 ‹Ұ›  GC=Lu    CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE

Of those 239 distinct trans-ASCII code points, 59 of them are also outside Unicode’s Basic Multilingual Plane, so any processing must be able to handle the full range of Unicode. All but one of these are mathematical letters. These are the top 20 of those:

rank  percent cumulative  code glyph  GC=??   Name
---------------------------------------------------------------------
 227   0.004%  99.660%  U+1D49E ‹𝒞›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL C
 240   0.003%  99.704%  U+1D4AF ‹𝒯›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL T
 252   0.003%  99.738%  U+1D4AE ‹𝒮›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL S
 275   0.002%  99.791%  U+1D49F ‹𝒟›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL D
 279   0.002%  99.799%  U+1D4B3 ‹𝒳›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL X
 289   0.002%  99.818%  U+1D4A9 ‹𝒩›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL N
 291   0.002%  99.821%  U+1D4AB ‹𝒫›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL P
 292   0.002%  99.823%  U+1D4A2 ‹𝒢›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL G
 313   0.001%  99.854%  U+1D49C ‹𝒜›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL A
 316   0.001%  99.858%  U+1D53C ‹𝔼›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL E
 341   0.001%  99.884%  U+1D4AA ‹𝒪›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL O
 430   0.000%  99.941%  U+1D4A5 ‹𝒥›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL J
 450   0.000%  99.948%  U+1D4A6 ‹𝒦›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL K
 458   0.000%  99.950%  U+1D4B1 ‹𝒱›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL V
 461   0.000%  99.951%  U+1D4B2 ‹𝒲›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL W
 468   0.000%  99.953%  U+1D4B4 ‹𝒴›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Y
 469   0.000%  99.954%  U+1D4B5 ‹𝒵›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Z
 500   0.000%  99.962%  U+1D4B0 ‹𝒰›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL U
 518   0.000%  99.966%  U+1D4AC ‹𝒬›  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Q
 560   0.000%  99.973%  U+1D54A ‹𝕊›  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL S

Other corpora will vary. You have to know your dataset.

like image 194
tchrist Avatar answered Mar 06 '23 04:03

tchrist