Java - what are characters, code points and surrogates? What difference is there between them?

Tags:

I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.

I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?

In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.

643

asked Jun 01 '14 12:06

Alium Britt

1 Answers

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 2¹⁶.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java uses UTF-16 internally to represent text.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

answered Oct 04 '22 10:10

Cephalopod

Related questions
                            
                                PowerMock, mock a static method, THEN call real methods on all other statics
                            
                                Can @FunctionalInterfaces have default methods?
                            
                                Migrating from Maven to SBT
                            
                                Check for JCE Unlimited Strength Jurisdiction Policy files [duplicate]
                            
                                Guice and properties files
                            
                                Spring ApplicationListener is not receiving events
                            
                                Thread Confinement
                            
                                Is there asm nop equivalent in java?
                            
                                "Cannot create generic array of .." - how to create an Array of Map<String, Object>?
                            
                                Intellij IDEA setup on OS X
                            
                                Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipart). Attachment support is disabled
                            
                                How to get line count of textview before rendering?
                            
                                Initialize database without XML configuration, but using @Configuration
                            
                                Expiry time @cacheable spring boot
                            
                                PKIX path building failed: unable to find valid certification path to requested target
                            
                                data breakpoints in java/eclipse
                            
                                What is the difference between Thread.start() and Thread.run()?
                            
                                Permutation algorithm without recursion? Java
                            
                                What does "& 0xff" do?
                            
                                get all (derived) interfaces of a class

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java - what are characters, code points and surrogates? What difference is there between them?

Tags:

java

character-encoding

character

Alium Britt

People also ask

1 Answers

Cephalopod

Recent Activity

Donate For Us