Non-BMP characters are represented by an ordered pair (called a Surrogate Pair in unicode vocabulary) of two 16-bit codes. Even though non-BMP characters are human readable as a single character, Javascript's internal storage still treats them as two characters.
Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text.
The standard, which is maintained by the Unicode Consortium, defines 144,697 characters covering 159 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes.
The first 128 characters of Unicode are the same as the ASCII character set. The first 32 characters, U+0000 - U+001F (0-31) are called Control Codes. They are an inheritance from the past and most of them are now obsolete. They were used for teletype machines, something that existed before the fax.
Emoji are now the most common non-BMP characters by far. ๐, otherwise known as U+1F602 FACE WITH TEARS OF JOY, is the most common one on Twitter's public stream. It occurs more frequently than the tilde!
Excellent question!
The answer is the mathematical letters. This past December I did a scan of the entire PubMed Open Access corpus, and came up with these figures for astral characters in it.
The first number in the figures below is how many copies of each given code point I found in the entire corpus. First, though, to give you a notion on the relative frequencies, here are the top ten trans-ASCII code points in that corpus:
2663710 U+002013 โนโโบ GC=Pd EN DASH
1065594 U+0000A0 โนย โบ GC=Zs NO-BREAK SPACE
1009762 U+0000B1 โนยฑโบ GC=Sm PLUS-MINUS SIGN
784139 U+002212 โนโโบ GC=Sm MINUS SIGN
602377 U+002003 โนโโบ GC=Zs EM SPACE
528576 U+0003BC โนฮผโบ GC=Ll GREEK SMALL LETTER MU
519669 U+0003B2 โนฮฒโบ GC=Ll GREEK SMALL LETTER BETA
512312 U+0003B1 โนฮฑโบ GC=Ll GREEK SMALL LETTER ALPHA
491842 U+00200A โนโโบ GC=Zs HAIR SPACE
462505 U+0000B0 โนยฐโบ GC=So DEGREE SIGN
And here now are the trans-BMP code points, in order of decending frequency:
544 U+01D49E โน๐โบ GC=Lu MATHEMATICAL SCRIPT CAPITAL C
450 U+01D4AF โน๐ฏโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL T
385 U+01D4AE โน๐ฎโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL S
292 U+01D49F โน๐โบ GC=Lu MATHEMATICAL SCRIPT CAPITAL D
285 U+01D4B3 โน๐ณโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL X
262 U+01D4A9 โน๐ฉโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL N
258 U+01D4AB โน๐ซโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL P
254 U+01D4A2 โน๐ขโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL G
185 U+01D49C โน๐โบ GC=Lu MATHEMATICAL SCRIPT CAPITAL A
178 U+01D53C โน๐ผโบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL E
137 U+01D4AA โน๐ชโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL O
56 U+01D4A5 โน๐ฅโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL J
48 U+01D4A6 โน๐ฆโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL K
44 U+01D4B1 โน๐ฑโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL V
43 U+01D4B2 โน๐ฒโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL W
42 U+01D4B4 โน๐ดโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL Y
41 U+01D4B5 โน๐ตโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL Z
35 U+01D4B0 โน๐ฐโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL U
30 U+01D4AC โน๐ฌโบ GC=Lu MATHEMATICAL SCRIPT CAPITAL Q
23 U+01D54A โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL S
21 U+01D539 โน๐นโบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL B
19 U+01D5A7 โน๐งโบ GC=Lu MATHEMATICAL SANS-SERIF CAPITAL H
18 U+01D517 โน๐โบ GC=Lu MATHEMATICAL FRAKTUR CAPITAL T
15 U+01D4C3 โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL N
14 U+01D535 โน๐ตโบ GC=Ll MATHEMATICAL FRAKTUR SMALL X
13 U+01D4BF โน๐ฟโบ GC=Ll MATHEMATICAL SCRIPT SMALL J
11 U+01D540 โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL I
9 U+01D465 โน๐ฅโบ GC=Ll MATHEMATICAL ITALIC SMALL X
9 U+01D4CE โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL Y
9 U+01D538 โน๐ธโบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL A
8 U+01D4C2 โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL M
8 U+01D54D โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL V
7 U+01D4B6 โน๐ถโบ GC=Ll MATHEMATICAL SCRIPT SMALL A
7 U+01D4BE โน๐พโบ GC=Ll MATHEMATICAL SCRIPT SMALL I
7 U+01D4CC โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL W
7 U+01D516 โน๐โบ GC=Lu MATHEMATICAL FRAKTUR CAPITAL S
7 U+01D4BE โน๐พโบ GC=Ll MATHEMATICAL SCRIPT SMALL I
7 U+01D4CC โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL W
7 U+01D516 โน๐โบ GC=Lu MATHEMATICAL FRAKTUR CAPITAL S
4 U+01D4CF โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL Z
4 U+01D53B โน๐ปโบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL D
4 U+01D54B โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL T
3 U+01D4BB โน๐ปโบ GC=Ll MATHEMATICAL SCRIPT SMALL F
3 U+01D4CA โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL U
3 U+01D507 โน๐โบ GC=Lu MATHEMATICAL FRAKTUR CAPITAL D
3 U+01D542 โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL K
3 U+01D546 โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL O
2 U+01D4BD โน๐ฝโบ GC=Ll MATHEMATICAL SCRIPT SMALL H
2 U+01D4C5 โน๐
โบ GC=Ll MATHEMATICAL SCRIPT SMALL P
2 U+01D505 โน๐
โบ GC=Lu MATHEMATICAL FRAKTUR CAPITAL B
2 U+01D50E โน๐โบ GC=Lu MATHEMATICAL FRAKTUR CAPITAL K
2 U+01D541 โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL J
2 U+01D543 โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL L
2 U+100002 โน๔โบ GC=Co <private use character>
1 U+01D4B8 โน๐ธโบ GC=Ll MATHEMATICAL SCRIPT SMALL C
1 U+01D4C1 โน๐โบ GC=Ll MATHEMATICAL SCRIPT SMALL L
1 U+01D53D โน๐ฝโบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL F
1 U+01D53E โน๐พโบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL G
1 U+01D54C โน๐โบ GC=Lu MATHEMATICAL DOUBLE-STRUCK CAPITAL U
1 U+01D6A4 โน๐คโบ GC=Ll MATHEMATICAL ITALIC SMALL DOTLESS I
1 U+01D7D9 โน๐โบ GC=Nd MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
I really wish I knew what they were using U+100002 to do. :(
If those aren't showing up in your browser, you should install George Dourosโs Symbola font. It also has all the fun Unicode 6.0.0 code points in it, too.
For me, the Mathematical Alphanumeric Symbols that are used for math typesetting with OpenType fonts such as Cambria Math.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With