Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a surrogate pair?

I came across this code in a javascript open source project.

validator.isLength = function (str, min, max) 
    // match surrogate pairs in string or declare an empty array if none found in string
    var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
    // subtract the surrogate pairs string length from main string length
    var len = str.length - surrogatePairs.length;
    // now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
    return len >= min && (typeof max === 'undefined' || len <= max);
};

As far as I understand, the above code is checking the length of the string but not taking the surrogate pairs into account. So:

  1. Is my understanding of the code correct?

  2. What are surrogate pairs?

I have thus far only figured out that this is related to encoding.

like image 956
Noman Ur Rehman Avatar asked Aug 13 '15 11:08

Noman Ur Rehman


People also ask

What are surrogate pairs in Unicode?

With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.

What is a surrogate character?

These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.

What is surrogate pairs Javascript?

Surrogate pair is a representation for a single abstract character that consists of a sequence of code units of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit. An astral code point requires two code units β€” a surrogate pair.

What is XML surrogate pair?

A surrogate pair consists of a high surrogate and a low surrogate; having two high surrogates next to each other is not a "surrogate pair", it is nonsense. Also note that the correct way to represent a non-BMP character in XML is as a single character reference for the combined character, for example &#x120AB; .


1 Answers

  1. Yes. Your understanding is correct. The function returns the length of the string in Unicode Code Points.

  2. JavaScript is using UTF-16 to encode its strings. This means two bytes (16-bit) are used to represent one Unicode Code Point.

    Now there are characters (like the Emojis) in Unicode that have a that high code point so that they cannot be stored in 2 bytes (16bit) so they need to get encoded into two UTF-16 characters (4 bytes). These are called surrogate pairs.

Try this

var len = "πŸ˜€".length // There is an emoji in the string (if you don’t see it)

vs

var str = "πŸ˜€"
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;

In the first example len will be 2 because the Emoji consists of two 2 UTF-16 characters. In the second example len will be 1.

You might want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

like image 117
idmean Avatar answered Oct 12 '22 16:10

idmean