Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the size of `char` 4 bytes in Rust?

Tags:

rust

This code shows that char takes 4 bytes:

println!("char : {}", std::mem::size_of::<char>());
  1. Why does it take 4 bytes?.
  2. Does the size depend on the platform, or is it always 4 bytes?
  3. If it's always 4 bytes, it is for something special?
  4. Does the compiler guarantee some minimum size for the size of char?

In https://play.rust-lang.org/ I also get 4 bytes

like image 849
Angel Angel Avatar asked Apr 03 '16 02:04

Angel Angel


People also ask

Can char be 4 bytes?

The char type takes 1 byte of memory (8 bits) and allows expressing in the binary notation 2^8=256 values. The char type can contain both positive and negative values. The range of values is from -128 to 127.

What is a char in Rust?

A char is a 'Unicode scalar value', which is any 'Unicode code point' other than a surrogate code point. This has a fixed numerical definition: code points are in the range 0 to 0x10FFFF, inclusive. Surrogate code points, used by UTF-16, are in the range 0xD800 to 0xDFFF.

How big is a char in x86?

In x86, x86-64, ARM architectures char size is 8 bits, which is same as the smallest integer size.


3 Answers

First of all: a char in Rust is a unique integral value representing a Unicode Scalar value. For example, consider 💩 (aka Pile of Poo, aka U+1F4A9), in Rust it will be represented by a char with a value of 128169 in decimal (that is 0x1F4A9 in hexadecimal):

fn main() {
    let c: char = "💩".chars().next().unwrap();
    println!("💩 is {} ({})", c, c as u32);
}

On the playpen.

With that said, the Rust char is 4 bytes because 4 bytes is the smallest power of 2 number of bytes which can hold the integral value of any Unicode Scalar value. The decision was driven by the domain, not by architectural constraints.


Note: the emphasis on Scalar value is that a number of "characters" as we see them are actually graphemes composed by multiple combining characters in Unicode, in this case multiple char are required.

like image 148
Matthieu M. Avatar answered Oct 21 '22 18:10

Matthieu M.


char is four bytes. It is always four bytes, it will always be four bytes. Four bytes it be, and four bytes shall it remain.

It's not for anything special; four bytes is simply the smallest power of two in which you can store any Unicode scalar value. Various other languages do the same thing.

like image 35
DK. Avatar answered Oct 21 '22 17:10

DK.


Char is four bytes, it doesn't depend on the architecture.

Why? According to UTF-8 Wikipedia's article.

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use. Four bytes are needed for characters in the other planes of Unicode.

So if you want to represent any possible Unicode character the compiler must save 4 bytes.

You should also consider Byte Alignment: http://www.eventhelix.com/realtimemantra/ByteAlignmentAndOrdering.htm

like image 33
Fylux Avatar answered Oct 21 '22 18:10

Fylux