This code shows that char
takes 4 bytes:
println!("char : {}", std::mem::size_of::<char>());
char
?In https://play.rust-lang.org/ I also get 4 bytes
The char type takes 1 byte of memory (8 bits) and allows expressing in the binary notation 2^8=256 values. The char type can contain both positive and negative values. The range of values is from -128 to 127.
A char is a 'Unicode scalar value', which is any 'Unicode code point' other than a surrogate code point. This has a fixed numerical definition: code points are in the range 0 to 0x10FFFF, inclusive. Surrogate code points, used by UTF-16, are in the range 0xD800 to 0xDFFF.
In x86, x86-64, ARM architectures char size is 8 bits, which is same as the smallest integer size.
First of all: a char
in Rust is a unique integral value representing a Unicode Scalar value. For example, consider 💩 (aka Pile of Poo, aka U+1F4A9), in Rust it will be represented by a char
with a value of 128169
in decimal (that is 0x1F4A9
in hexadecimal):
fn main() {
let c: char = "💩".chars().next().unwrap();
println!("💩 is {} ({})", c, c as u32);
}
On the playpen.
With that said, the Rust char
is 4 bytes because 4 bytes is the smallest power of 2 number of bytes which can hold the integral value of any Unicode Scalar value. The decision was driven by the domain, not by architectural constraints.
Note: the emphasis on Scalar value is that a number of "characters" as we see them are actually graphemes composed by multiple combining characters in Unicode, in this case multiple char
are required.
char
is four bytes. It is always four bytes, it will always be four bytes. Four bytes it be, and four bytes shall it remain.
It's not for anything special; four bytes is simply the smallest power of two in which you can store any Unicode scalar value. Various other languages do the same thing.
Char is four bytes, it doesn't depend on the architecture.
Why? According to UTF-8 Wikipedia's article.
The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use. Four bytes are needed for characters in the other planes of Unicode.
So if you want to represent any possible Unicode character the compiler must save 4 bytes.
You should also consider Byte Alignment: http://www.eventhelix.com/realtimemantra/ByteAlignmentAndOrdering.htm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With