Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java's and C#'s implementation of String

In the Java and C# implementation of String, is the underlying information a null-terminated char array like in C/C++?

(In addition to other information like size, etc.)

like image 694
Daniel Avatar asked Sep 08 '11 18:09

Daniel


3 Answers

No. It is a sequence of UTF-16 code units and a length. Java and C# strings can contain embedded NULs.

Each UTF-16 code-unit occupies two bytes, so you can think of the string "\n\0\n" as:

{
  length: 3,  // 3 pairs of bytes == 3 UTF-16 code units
  bytes:  [0, 10, // \n
           0, 0,  // \0
           0, 10] // \n
}

Note that the last byte in bytes is not 0. The length field tells how many of the bytes are used. This allows substring to be very efficient -- reuse the same byte array, but with a different length (and offset if your VM implementation can't point into an array).

UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.

From javadoc

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

C# System.String is defined similarly

Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char. The resulting collection of Char objects constitutes the String.

I'm not sure whether C# guards against orphaned surrogates, but the above text seems to mix the terms "scalar value" and "codepoint" which is confusing. A scalar value is defined thus by unicode.org:

Any Unicode code point except high-surrogate and low-surrogate code points

Java definitely takes the codepoint view, and does not attempt to guard against invalid scalar values in strings.

"Strings Immutability and Persistence" explains the efficiency benefits of this representation.

One of the benefits of the immutable data types I've talked about here previously is that they are not just immutable, they are also "persistent". By "persistent", I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.

EDIT: The above is true conceptually and in practice, but VMs and CLRs have freedom to do things differently in certain situations.

The Java language specification mandates that strings are laid out a certain way in .class files, and its JNI jstring type abstracts away in-memory representation details so a VM could, in theory, represent a string in memory as a NUL-terminated UTF-8 string with a two-byte form used for embedded NUL characters instead of the int32 length and uint16[] bytes representation that allows for efficient random access to code-units.

VMs don't do this in practice though. "The Most Expensive One-byte Mistake" argues that NUL-terminated strings were a huge mistake in C, so I doubt VMs will adopt them internally for efficiency reasons.

The best candidate I have been able to come up with is the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

...

Thinking a bit about virtual memory systems settles that question for us. Optimizing the movement of a known-length string of bytes can take advantage of the full width of memory buses and cache lines, without ever touching a memory location that is not part of the source or destination string.

One example is FreeBSD's libc, where the bcopy(3)/memcpy(3) implementation will move as much data as possible in chunks of "unsigned long," typically 32 or 64 bits, and then "mop up any trailing bytes" as the comment describes it, with byte-wide operations.2

If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a [virtual memory] page and the next [virtual memory] page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

like image 88
Mike Samuel Avatar answered Oct 08 '22 11:10

Mike Samuel


As an implementation detail, a string in the Microsoft implementation of the CLR is laid out in memory pretty much the same as a BSTR was in COM. (See http://blogs.msdn.com/b/ericlippert/archive/2003/09/12/52976.aspx for details on BSTRs.)

That is, a string is laid out as four bytes containing the length, followed by that many two-byte UTF-16 characters, followed by two bytes of zero.

Of course it is not necessary to end a length-prefixed string with a zero character, but it is certainly convenient to do so, particularly when you consider the scenarios where you have to interoperate between C# programs and unmanaged C++ or VB6 programs. The marshaller can sometimes save on some copying because it knows that the string is already in a null-terminated format.

As I said, that's an implementation detail; you shouldn't rely upon it.

I don't know what Java does.

like image 23
Eric Lippert Avatar answered Oct 08 '22 10:10

Eric Lippert


I can't speak for C#, but Java's String source says no. Size information of the array is stored in the array, giving you no need for a null termination.

public final class String implements java.io.Serializable, Comparable<String>, CharSequence
{
    /** The value is used for character storage. */
    private final char value[];

    /** The offset is the first index of the storage that is used. */
    private final int offset;

    /** The count is the number of characters in the String. */
    private final int count;

    /** Cache the hash code for the string */
    private int hash; // Default to 0

    // ... rest of class
}
like image 1
corsiKa Avatar answered Oct 08 '22 11:10

corsiKa