Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between composite characters and surrogate pairs

In Unicode what is the difference between composite characters and surrogate pairs?

To me they sound like similar things - two characters to represent one character. What differentiates these two concepts?

like image 575
Sachin Kainth Avatar asked Mar 01 '14 22:03

Sachin Kainth


3 Answers

Surrogate pairs are a weird wart in Unicode.

Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 221 are available, though not all are in use. In the context of Unicode, each number is know as a code point.

However, the Unicode suite as a whole contains more than just this encoding. It also contains technologies to serialize code points. This is essentially just an exercise in serializing unsigned integers. Three subfamilies of technologies are specified: UTF-32, UTF-8, and UTF-16.

UTF-32 simply expresses every code-point as a 32-bit unsigned integer. That's easy. Two variants exist, for big and little endian, respectively. Each 32-bit serialized integer is called the code unit of this format, and this is a fixed-width format (one code point per code unit).

UTF-8 is a clever multi-byte format, in which code points take up anything from one to six 8-bit bytes. This format is very portable, since it has no ordering issues and since it is pretty compact for English, near-English and computer code. The code unit of UTF-8 is one byte, and this is a variable-width format (1–6 code units per code point).

Finally, there's UTF-16: Initially, people thought Unicode could do with only 216 numbers, so this was initially deemed to be fixed-width, with 16-bit code units. However, it eventually became clear that we needed larger numbers. So UTF-16 is now also a variable-width format, but the way this is achieved is that certain 16-bit code units act as indicators that they are part of a two-unit pair, the surrogate pair. However, to simplify the way you detect those pairs, rather than having some external envelope format as UTF-8 does, the actual 16-bit values that are used by the surrogates are deliberately leaked back into the Unicode encoding and left out of the encoding - that is, the surrogate values, 0xD800 to 0xDFFF, are not valid Unicode code points.

So, in summary, surrogates are the result of forcing a serialization format for Unicode back into the encoding, and distorting the design of the encoding to accommodate the serialization format. This is perhaps an unfortunate historical accident, which is somewhat pointless and unsightly in retrospect, but it's what we have and what we need to live with.


Composite characters, on the other hand, are something much higher-level: They are visual units ("graphemes") that are composed of multiple Unicode code points. Sometimes people refer to code points themselves as "characters", but that's a little bit misleading, since characters should really be graphemes, and they can consist of several components (e.g. a base letter plus diacritics and modifiers).

like image 165
Kerrek SB Avatar answered Oct 17 '22 22:10

Kerrek SB


An example of a composite character is Unicode U+0039, É. It should display identically to the decomposed pair U+0045 E and U+0301 (the combining acute accent character). This is independent of any byte encoding use to actually store the character; it's just two different ways of representing the same graphical character using Unicode.

A surrogate pair is specific to UTF-16, which uses two 16-bit values to represent a single Unicode code point greater than U+FFFF (which obviously cannot fit in a single 16-bit value). For example (from the Wikipedia article), code point U+1D11E is serialized as the two 16-bit values 0xD834 and 0xDD1E. (The actual byte sequence used to represent them will depend on whether you use the big endian or little endian version of UTF-16.)

like image 44
chepner Avatar answered Oct 17 '22 20:10

chepner


TL;DR

  • Composite character: e¨ → ë
  • Surrogate pair: 0xD83D + 0xDCA9 → 💩

Long Version

Composite Characters (vs Ready-made)

Take the string "Noël"

It has two representations in Unicode:

  • Noël
  • Noël

You probably can't tell the difference. One is made up of four code units, the other is made up of five:

  • Noël: Noe¨l
  • Noël: Noël

One of them uses a "composite" character, and the other uses a "ready-made" character:

  • e¨: U+0065 Latin Small Letter E U+0308 Combining Diaeresis
  • ë: U+00EB Latin Small Letter E With Diaeresis

In other words:

  • Noël: uses the "composite ë character"
  • Noël: uses the "ready-made ë character"

It's important to note that these strings are identical. Both these strings represent the same word, and specifically the same character. Except that one happens to have a "ready-made" character.

Not every character has a "ready-made" equivalent. For example:

  • ̊q: q ˚

That is a small latin q with ring above. There is no ready-made version, you have to use the combining diacritic. If there was ready-made version, that just means they are two different representations of the same character.

So that's a "composite character": it's the opposite of a "ready-made character".

Surrogate Pairs

Lets look at Noël again (the one using the ready-made character). It consists of 4 characters:

  • Noël
  • U+004E U+006F U+00EB U+006C

It is four numbers:

UInt32[] text = [0x0000004E, 0x0000006F, 0x000000EB, 0x0000006C];

Those numbers happen to all be less than 16-bits, so a lot of people might be tempted to instead use an array of UInt16:

UInt16[] text = [0x004E, 0x006F, 0x00EB, 0x006C];

The problem is that not every unicode character is 16-bit. Unicode characters are the full 32-bit.

Take for example:

  • U+1F449 U+1F351 U+1F44D
  • 👉🍑👍

For this we need the full 32-bits to represent each character:

UInt32 text = [0x0001F449, 0x0001F351, 0x0001F44D];

And that is all well and good and functional.

But people hate 32-bit numbers

People feel that using a full 32-bits to represent every character is a waste. And since the entire world basically speaks english anyway, isn't there a way we can just mostly use 16-bits instead?

Enter UTF-16

People came up with a clever way to try to stuff 32-bit numbers into a 16-bit array.

Lets look at U+1F4A9 (💩), and it's various encodings:

  • UInt32[] poop32 = [0x0001F4A9]; //UTF-32
  • UInt16[] poop16 = [0xD83D, 0xDCA9]; //UTF-16
  • UInt8[] poop8 = [0xF0, 0x9F, 0x92, 0xA9]; //UTF-8

You see that in UTF-16, in order to represent on character you need 2 code points:

  • 0xD83D + 0xDCA9 → 💩

Those two values have to go together. They are a pair - a surrogate pair. If you omitted the 2nd UInt16, then you are left with some that is invalid:

  • 0xD83Dinvalid!
like image 2
Ian Boyd Avatar answered Oct 17 '22 22:10

Ian Boyd