Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the most common non-BMP Unicode characters in actual use? [closed]

People also ask

What are non BMP characters?

Non-BMP characters are represented by an ordered pair (called a Surrogate Pair in unicode vocabulary) of two 16-bit codes. Even though non-BMP characters are human readable as a single character, Javascript's internal storage still treats them as two characters.

What characters are Unicode?

Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text.

How many Unicode characters are there?

The standard, which is maintained by the Unicode Consortium, defines 144,697 characters covering 159 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes.

What is the first Unicode character?

The first 128 characters of Unicode are the same as the ASCII character set. The first 32 characters, U+0000 - U+001F (0-31) are called Control Codes. They are an inheritance from the past and most of them are now obsolete. They were used for teletype machines, something that existed before the fax.


Emoji are now the most common non-BMP characters by far. ๐Ÿ˜‚, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter's public stream. It occurs more frequently than the tilde!


Excellent question!

The answer is the mathematical letters. This past December I did a scan of the entire PubMed Open Access corpus, and came up with these figures for astral characters in it.

The first number in the figures below is how many copies of each given code point I found in the entire corpus. First, though, to give you a notion on the relative frequencies, here are the top ten trans-ASCII code points in that corpus:

 2663710 U+002013 โ€นโ€“โ€บ  GC=Pd    EN DASH
 1065594 U+0000A0 โ€นย โ€บ  GC=Zs    NO-BREAK SPACE
 1009762 U+0000B1 โ€นยฑโ€บ  GC=Sm    PLUS-MINUS SIGN
  784139 U+002212 โ€นโˆ’โ€บ  GC=Sm    MINUS SIGN
  602377 U+002003 โ€นโ€ƒโ€บ  GC=Zs    EM SPACE
  528576 U+0003BC โ€นฮผโ€บ  GC=Ll    GREEK SMALL LETTER MU
  519669 U+0003B2 โ€นฮฒโ€บ  GC=Ll    GREEK SMALL LETTER BETA
  512312 U+0003B1 โ€นฮฑโ€บ  GC=Ll    GREEK SMALL LETTER ALPHA
  491842 U+00200A โ€นโ€Šโ€บ  GC=Zs    HAIR SPACE
  462505 U+0000B0 โ€นยฐโ€บ  GC=So    DEGREE SIGN

And here now are the trans-BMP code points, in order of decending frequency:

     544 U+01D49E โ€น๐’žโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL C
     450 U+01D4AF โ€น๐’ฏโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL T
     385 U+01D4AE โ€น๐’ฎโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL S
     292 U+01D49F โ€น๐’Ÿโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL D
     285 U+01D4B3 โ€น๐’ณโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL X
     262 U+01D4A9 โ€น๐’ฉโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL N
     258 U+01D4AB โ€น๐’ซโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL P
     254 U+01D4A2 โ€น๐’ขโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL G
     185 U+01D49C โ€น๐’œโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL A
     178 U+01D53C โ€น๐”ผโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL E
     137 U+01D4AA โ€น๐’ชโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL O
      56 U+01D4A5 โ€น๐’ฅโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL J
      48 U+01D4A6 โ€น๐’ฆโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL K
      44 U+01D4B1 โ€น๐’ฑโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL V
      43 U+01D4B2 โ€น๐’ฒโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL W
      42 U+01D4B4 โ€น๐’ดโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Y
      41 U+01D4B5 โ€น๐’ตโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Z
      35 U+01D4B0 โ€น๐’ฐโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL U
      30 U+01D4AC โ€น๐’ฌโ€บ  GC=Lu    MATHEMATICAL SCRIPT CAPITAL Q
      23 U+01D54A โ€น๐•Šโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL S
      21 U+01D539 โ€น๐”นโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL B
      19 U+01D5A7 โ€น๐–งโ€บ  GC=Lu    MATHEMATICAL SANS-SERIF CAPITAL H
      18 U+01D517 โ€น๐”—โ€บ  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL T
      15 U+01D4C3 โ€น๐“ƒโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL N
      14 U+01D535 โ€น๐”ตโ€บ  GC=Ll    MATHEMATICAL FRAKTUR SMALL X
      13 U+01D4BF โ€น๐’ฟโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL J
      11 U+01D540 โ€น๐•€โ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL I
       9 U+01D465 โ€น๐‘ฅโ€บ  GC=Ll    MATHEMATICAL ITALIC SMALL X
       9 U+01D4CE โ€น๐“Žโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL Y
       9 U+01D538 โ€น๐”ธโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL A
       8 U+01D4C2 โ€น๐“‚โ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL M
       8 U+01D54D โ€น๐•โ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL V
       7 U+01D4B6 โ€น๐’ถโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL A
       7 U+01D4BE โ€น๐’พโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL I
       7 U+01D4CC โ€น๐“Œโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL W
       7 U+01D516 โ€น๐”–โ€บ  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL S
       7 U+01D4BE โ€น๐’พโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL I
       7 U+01D4CC โ€น๐“Œโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL W
       7 U+01D516 โ€น๐”–โ€บ  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL S
       4 U+01D4CF โ€น๐“โ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL Z
       4 U+01D53B โ€น๐”ปโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL D
       4 U+01D54B โ€น๐•‹โ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL T
       3 U+01D4BB โ€น๐’ปโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL F
       3 U+01D4CA โ€น๐“Šโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL U
       3 U+01D507 โ€น๐”‡โ€บ  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL D
       3 U+01D542 โ€น๐•‚โ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL K
       3 U+01D546 โ€น๐•†โ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL O
       2 U+01D4BD โ€น๐’ฝโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL H
       2 U+01D4C5 โ€น๐“…โ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL P
       2 U+01D505 โ€น๐”…โ€บ  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL B
       2 U+01D50E โ€น๐”Žโ€บ  GC=Lu    MATHEMATICAL FRAKTUR CAPITAL K
       2 U+01D541 โ€น๐•โ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL J
       2 U+01D543 โ€น๐•ƒโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL L
       2 U+100002 โ€น๔€€‚โ€บ  GC=Co    <private use character>
       1 U+01D4B8 โ€น๐’ธโ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL C
       1 U+01D4C1 โ€น๐“โ€บ  GC=Ll    MATHEMATICAL SCRIPT SMALL L
       1 U+01D53D โ€น๐”ฝโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL F
       1 U+01D53E โ€น๐”พโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL G
       1 U+01D54C โ€น๐•Œโ€บ  GC=Lu    MATHEMATICAL DOUBLE-STRUCK CAPITAL U
       1 U+01D6A4 โ€น๐šคโ€บ  GC=Ll    MATHEMATICAL ITALIC SMALL DOTLESS I
       1 U+01D7D9 โ€น๐Ÿ™โ€บ  GC=Nd    MATHEMATICAL DOUBLE-STRUCK DIGIT ONE

I really wish I knew what they were using U+100002 to do. :(

If those aren't showing up in your browser, you should install George Dourosโ€™s Symbola font. It also has all the fun Unicode 6.0.0 code points in it, too.


For me, the Mathematical Alphanumeric Symbols that are used for math typesetting with OpenType fonts such as Cambria Math.