How can I find information about a Unicode character(e.g. character set it belongs to) in Java script ?
E.g.
00e9 LATIN SMALL LETTER E WITH ACUTE
0bf2 TAMIL NUMBER ONE THOUSAND
I am aware of a way to find details about a Unicode code point in python, using theunicodedata
library. Is there a way to find out this information in JS?
PS: I am using this for chrome extension development, so a solution using their APIs is also good.
English-language text is dominated by code points from the Latin, Common, and Inherited scripts, and in some corpora, also Greek.
For example, the PubMed Open Access collection, which is a very large collection of all English-language text, is filled with non-ASCII code points. Fully 90% of these are accounted for by only 36 distinct code points, as follows:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
1 18.553% 18.553% U+02013 ‹–› GC=Pd EN DASH
2 7.422% 25.974% U+000A0 ‹ › GC=Zs NO-BREAK SPACE
3 7.033% 33.007% U+000B1 ‹±› GC=Sm PLUS-MINUS SIGN
4 5.461% 38.469% U+02212 ‹−› GC=Sm MINUS SIGN
5 4.196% 42.664% U+02003 ‹ › GC=Zs EM SPACE
6 3.682% 46.346% U+003BC ‹μ› GC=Ll GREEK SMALL LETTER MU
7 3.619% 49.965% U+003B2 ‹β› GC=Ll GREEK SMALL LETTER BETA
8 3.568% 53.534% U+003B1 ‹α› GC=Ll GREEK SMALL LETTER ALPHA
9 3.426% 56.959% U+0200A ‹ › GC=Zs HAIR SPACE
10 3.221% 60.181% U+000B0 ‹°› GC=So DEGREE SIGN
11 2.931% 63.112% U+02009 ‹ › GC=Zs THIN SPACE
12 2.620% 65.732% U+02019 ‹’› GC=Pf RIGHT SINGLE QUOTATION MARK
13 2.506% 68.238% U+02032 ‹′› GC=Po PRIME
14 2.441% 70.679% U+000D7 ‹×› GC=Sm MULTIPLICATION SIGN
15 2.042% 72.722% U+0201D ‹”› GC=Pf RIGHT DOUBLE QUOTATION MARK
16 2.039% 74.761% U+0201C ‹“› GC=Pi LEFT DOUBLE QUOTATION MARK
17 1.536% 76.296% U+00394 ‹Δ› GC=Lu GREEK CAPITAL LETTER DELTA
18 1.415% 77.712% U+000B5 ‹µ› GC=Ll MICRO SIGN
19 1.337% 79.049% U+003B3 ‹γ› GC=Ll GREEK SMALL LETTER GAMMA
20 1.210% 80.259% U+000E9 ‹é› GC=Ll LATIN SMALL LETTER E WITH ACUTE
21 1.152% 81.410% U+02014 ‹—› GC=Pd EM DASH
22 1.135% 82.546% U+02018 ‹‘› GC=Pi LEFT SINGLE QUOTATION MARK
23 0.998% 83.543% U+000A9 ‹©› GC=So COPYRIGHT SIGN
24 0.710% 84.253% U+02265 ‹≥› GC=Sm GREATER-THAN OR EQUAL TO
25 0.600% 84.853% U+000F6 ‹ö› GC=Ll LATIN SMALL LETTER O WITH DIAERESIS
26 0.599% 85.452% U+000B7 ‹·› GC=Po MIDDLE DOT
27 0.597% 86.049% U+02022 ‹•› GC=Po BULLET
28 0.594% 86.644% U+0223C ‹∼› GC=Sm TILDE OPERATOR
29 0.573% 87.217% U+003BA ‹κ› GC=Ll GREEK SMALL LETTER KAPPA
30 0.569% 87.785% U+000FC ‹ü› GC=Ll LATIN SMALL LETTER U WITH DIAERESIS
31 0.493% 88.278% U+02264 ‹≤› GC=Sm LESS-THAN OR EQUAL TO
32 0.440% 88.718% U+000AE ‹®› GC=So REGISTERED SIGN
33 0.433% 89.152% U+000E4 ‹ä› GC=Ll LATIN SMALL LETTER A WITH DIAERESIS
34 0.422% 89.573% U+02020 ‹†› GC=Po DAGGER
35 0.407% 89.980% U+003B4 ‹δ› GC=Ll GREEK SMALL LETTER DELTA
One way to detect those would be to use the Unicode regular expression that says a character must either be from the Latin, Greek, Common, or Inherited scripts.
In this corpus, the top four comprise will over 99% of the code points. However, there are also a great many super-low-frequency code points in this dataset that fall outside those four scripts (e.g. Cyrillic, Han, Kana, Hangul, etc.). You would throw those out as false negatives if you restricted input to the four ultra-common scripts previously listed. There are 239 such distinct code points in this dataset, of which the top 50 most frequent are the following:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
295 0.002% 99.828% U+00424 ‹Ф› GC=Lu CYRILLIC CAPITAL LETTER EF
381 0.001% 99.916% U+0043A ‹к› GC=Ll CYRILLIC SMALL LETTER KA
454 0.000% 99.949% U+00413 ‹Г› GC=Lu CYRILLIC CAPITAL LETTER GHE
491 0.000% 99.959% U+0AD6D ‹국› GC=Lo HANGUL SYLLABLE GUG
499 0.000% 99.961% U+003EC ‹Ϭ› GC=Lu COPTIC CAPITAL LETTER SHIMA
513 0.000% 99.965% U+00406 ‹І› GC=Lu CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
528 0.000% 99.968% U+00416 ‹Ж› GC=Lu CYRILLIC CAPITAL LETTER ZHE
534 0.000% 99.969% U+00430 ‹а› GC=Ll CYRILLIC SMALL LETTER A
539 0.000% 99.970% U+0041F ‹П› GC=Lu CYRILLIC CAPITAL LETTER PE
545 0.000% 99.971% U+00421 ‹С› GC=Lu CYRILLIC CAPITAL LETTER ES
553 0.000% 99.972% U+0D55C ‹한› GC=Lo HANGUL SYLLABLE HAN
555 0.000% 99.972% U+00404 ‹Є› GC=Lu CYRILLIC CAPITAL LETTER UKRAINIAN IE
566 0.000% 99.974% U+0C5B4 ‹어› GC=Lo HANGUL SYLLABLE EO
567 0.000% 99.974% U+0041A ‹К› GC=Lu CYRILLIC CAPITAL LETTER KA
568 0.000% 99.974% U+0041B ‹Л› GC=Lu CYRILLIC CAPITAL LETTER EL
571 0.000% 99.975% U+0B2C8 ‹니› GC=Lo HANGUL SYLLABLE NI
575 0.000% 99.975% U+0AE4C ‹까› GC=Lo HANGUL SYLLABLE GGA
578 0.000% 99.976% U+00428 ‹Ш› GC=Lu CYRILLIC CAPITAL LETTER SHA
579 0.000% 99.976% U+00454 ‹є› GC=Ll CYRILLIC SMALL LETTER UKRAINIAN IE
585 0.000% 99.977% U+00418 ‹И› GC=Lu CYRILLIC CAPITAL LETTER I
587 0.000% 99.977% U+0B2E4 ‹다› GC=Lo HANGUL SYLLABLE DA
600 0.000% 99.978% U+00440 ‹р› GC=Ll CYRILLIC SMALL LETTER ER
610 0.000% 99.980% U+00457 ‹ї› GC=Ll CYRILLIC SMALL LETTER YI
614 0.000% 99.980% U+0C74C ‹음› GC=Lo HANGUL SYLLABLE EUM
623 0.000% 99.981% U+0BD80 ‹부› GC=Lo HANGUL SYLLABLE BU
624 0.000% 99.981% U+0C545 ‹악› GC=Lo HANGUL SYLLABLE AG
625 0.000% 99.981% U+0C778 ‹인› GC=Lo HANGUL SYLLABLE IN
640 0.000% 99.982% U+0C5D0 ‹에› GC=Lo HANGUL SYLLABLE E
641 0.000% 99.983% U+0C744 ‹을› GC=Lo HANGUL SYLLABLE EUL
645 0.000% 99.983% U+00438 ‹и› GC=Ll CYRILLIC SMALL LETTER I
664 0.000% 99.984% U+0041C ‹М› GC=Lu CYRILLIC CAPITAL LETTER EM
665 0.000% 99.984% U+00436 ‹ж› GC=Ll CYRILLIC SMALL LETTER ZHE
674 0.000% 99.985% U+0C774 ‹이› GC=Lo HANGUL SYLLABLE I
678 0.000% 99.985% U+00431 ‹б› GC=Ll CYRILLIC SMALL LETTER BE
679 0.000% 99.986% U+00435 ‹е› GC=Ll CYRILLIC SMALL LETTER IE
689 0.000% 99.986% U+0B300 ‹대› GC=Lo HANGUL SYLLABLE DAE
690 0.000% 99.986% U+0BD84 ‹분› GC=Lo HANGUL SYLLABLE BUN
691 0.000% 99.986% U+0C678 ‹외› GC=Lo HANGUL SYLLABLE OE
696 0.000% 99.987% U+005DB ‹כ› GC=Lo HEBREW LETTER KAF
703 0.000% 99.987% U+0B85C ‹로› GC=Lo HANGUL SYLLABLE RO
711 0.000% 99.988% U+0041D ‹Н› GC=Lu CYRILLIC CAPITAL LETTER EN
712 0.000% 99.988% U+004D9 ‹ә› GC=Ll CYRILLIC SMALL LETTER SCHWA
725 0.000% 99.988% U+0B294 ‹는› GC=Lo HANGUL SYLLABLE NEUN
726 0.000% 99.988% U+0B9CC ‹만› GC=Lo HANGUL SYLLABLE MAN
727 0.000% 99.988% U+0C11C ‹서› GC=Lo HANGUL SYLLABLE SEO
728 0.000% 99.989% U+0C2B5 ‹습› GC=Lo HANGUL SYLLABLE SEUB
729 0.000% 99.989% U+0C601 ‹영› GC=Lo HANGUL SYLLABLE YEONG
741 0.000% 99.989% U+00441 ‹с› GC=Ll CYRILLIC SMALL LETTER ES
742 0.000% 99.989% U+00444 ‹ф› GC=Ll CYRILLIC SMALL LETTER EF
743 0.000% 99.989% U+004B0 ‹Ұ› GC=Lu CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE
Of those 239 distinct trans-ASCII code points, 59 of them are also outside Unicode’s Basic Multilingual Plane, so any processing must be able to handle the full range of Unicode. All but one of these are mathematical letters. These are the top 20 of those:
rank percent cumulative code glyph GC=?? Name
---------------------------------------------------------------------
227 0.004% 99.660% U+1D49E ‹𝒞› GC=Lu MATHEMATICAL SCRIPT CAPITAL C
240 0.003% 99.704% U+1D4AF ‹𝒯› GC=Lu MATHEMATICAL SCRIPT CAPITAL T
252 0.003% 99.738% U+1D4AE ‹𝒮› GC=Lu MATHEMATICAL SCRIPT CAPITAL S
275 0.002% 99.791% U+1D49F ‹𝒟› GC=Lu MATHEMATICAL SCRIPT CAPITAL D
279 0.002% 99.799% U+1D4B3 ‹𝒳› GC=Lu MATHEMATICAL SCRIPT CAPITAL X
289 0.002% 99.818% U+1D4A9 ‹𝒩› GC=Lu MATHEMATICAL SCRIPT CAPITAL N
291 0.002% 99.821% U+1D4AB ‹𝒫› GC=Lu MATHEMATICAL SCRIPT CAPITAL P
292 0.002% 99.823% U+1D4A2 ‹𝒢› GC=Lu MATHEMATICAL SCRIPT CAPITAL G
313 0.001% 99.854% U+1D49C ‹𝒜› GC=Lu MATHEMATICAL SCRIPT CAPITAL A
316 0.001% 99.858% U+1D53C ‹𝔼› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E
341 0.001% 99.884% U+1D4AA ‹𝒪› GC=Lu MATHEMATICAL SCRIPT CAPITAL O
430 0.000% 99.941% U+1D4A5 ‹𝒥› GC=Lu MATHEMATICAL SCRIPT CAPITAL J
450 0.000% 99.948% U+1D4A6 ‹𝒦› GC=Lu MATHEMATICAL SCRIPT CAPITAL K
458 0.000% 99.950% U+1D4B1 ‹𝒱› GC=Lu MATHEMATICAL SCRIPT CAPITAL V
461 0.000% 99.951% U+1D4B2 ‹𝒲› GC=Lu MATHEMATICAL SCRIPT CAPITAL W
468 0.000% 99.953% U+1D4B4 ‹𝒴› GC=Lu MATHEMATICAL SCRIPT CAPITAL Y
469 0.000% 99.954% U+1D4B5 ‹𝒵› GC=Lu MATHEMATICAL SCRIPT CAPITAL Z
500 0.000% 99.962% U+1D4B0 ‹𝒰› GC=Lu MATHEMATICAL SCRIPT CAPITAL U
518 0.000% 99.966% U+1D4AC ‹𝒬› GC=Lu MATHEMATICAL SCRIPT CAPITAL Q
560 0.000% 99.973% U+1D54A ‹𝕊› GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S
Other corpora will vary. You have to know your dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With