What are surrogate characters in UTF-8?

Tags:

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.

211

asked Jun 23 '18 12:06

Gherman

1 Answers

What are surrogate characters in UTF-8?

This is almost like a trick question.

Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).

Approximate answer #2: Invalid (if not paired).

Approximate answer #3: It's not UTF-8; It's Modified UTF-8.

Synopsis: The term doesn't apply to UTF-8.

Unicode codepoints have a range that needs 21 bits of data.

UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.

UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.

#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.

#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).

#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.

Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

answered Sep 28 '22 10:09

Tom Blodget

Related questions
                            
                                Chrome says my content script isn't UTF-8
                            
                                Android newline in my EditText
                            
                                How to UTF-8 encode a character/string
                            
                                ANSI C UTF-8 problem
                            
                                iReport preview/exporter output not handling UTF-8 translations?
                            
                                Node.js unicode issue with HTTP response body
                            
                                UCS-2 and SQL Server
                            
                                handling filename* parameters with spaces via RFC 5987 results in '+' in filenames
                            
                                iTextSharp and special characters (slovak graphemes)
                            
                                What character encoding is used by StreamReader.ReadToEnd()?
                            
                                Printing out unicode from Java code issue in windows console
                            
                                Encoding for project set to UTF-8, default charset returns windows-1252
                            
                                Laravel 5.1 utf-8 saving to database
                            
                                Why is umlaut not recognized in a UTF-8-encoded Perl script with "use utf8"?
                            
                                can't insert russian text into mysql database
                            
                                How to use UTF-8 in PDFKit in Rails?
                            
                                UTF-8 encoding on WebView and ICS
                            
                                MySQL European Characters
                            
                                UTF-8 not working nginx
                            
                                PHP: How to split a UTF-8 string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are surrogate characters in UTF-8?

Tags:

utf-8

utf

surrogate-pairs

Gherman

People also ask

1 Answers

Tom Blodget

Recent Activity

Donate For Us