Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emacs 23 uses character set four times larger than Unicode - why?

From Emacs 23.1 NEWS:

*** The Emacs character set is now a superset of Unicode. (It has about four times the code space, which should be plenty).

And more details later on:

*** In multibyte buffers and strings, characters are represented by UTF-8 byte sequences. The character code space is now 0x0..0x3FFFFF with no gap; code points 0x0..0x10FFFF are Unicode characters of the same code points, while code points 0x3FFF80..0x3FFFFF are raw 8-bit bytes.

According to Wikipedia, the BMP of the UCS has 65536 characters, the latest version of Unicode contains more than 107000 characters, and the UCS has more than one million code points. 0x3FFFFF is more than four millions.

What problems could be solved or how otherwise it is beneficial to have internal character set that is a superset of Unicode?

like image 757
Laurynas Biveinis Avatar asked Nov 29 '22 20:11

Laurynas Biveinis


1 Answers

Unicode is designed to encompass the required character sets for all human languages, which is certainly useful for globalisation/localisation of your code, but because Emacs is the tool of the gods themselves, it has to also encompass every character that may be used by deities of all kinds ( including but not limited to the eldritch runes of the Great Old Ones), spacefaring races ( including but not limited to our future alien overlords ), ultra-intelligent-machine-intelligences ( including but not limited to our future robot masters ) and every other being that desires infinite cosmic power. That is potentially a whole lot of characters!

Or it could be to do with UTF-8 being a way of encoding characters that has much more space than is taken up by the Unicode set and Emacs just supporting the whole of UTF-8, but I prefer my explanation above.

like image 190
glenatron Avatar answered Dec 04 '22 08:12

glenatron