Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Isn't the size of character in Java 2 bytes?

Tags:

java

string

char

I used RandomAccessFile to read a byte from a text file.

public static void readFile(RandomAccessFile fr) {     byte[] cbuff = new byte[1];     fr.read(cbuff,0,1);     System.out.println(new String(cbuff)); } 

Why am I seeing one full character being read by this?

like image 700
Shrinath Avatar asked Feb 22 '11 12:02

Shrinath


People also ask

Is char 2 bytes in Java?

Java was designed for using Unicode Transformed Format (UTF)-16, when the UTF-16 was designed. The 'char' data type in Java originally used for representing 16-bit Unicode. Therefore the size of the char data type in Java is 2 byte, and same for the C language is 1 byte.

Is a character 2 bytes?

In Java, a character is encoded in UTF-16 which uses 2 bytes, while a normal C string is more or less just a bunch of bytes.

Is char 1 or 2 bytes?

The char type takes 1 byte of memory (8 bits) and allows expressing in the binary notation 2^8=256 values. The char type can contain both positive and negative values. The range of values is from -128 to 127.

Why does char take 2 bytes in Java?

And, every char is made up of 2 bytes because Java internally uses UTF-16. For instance, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte.


2 Answers

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).

That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).

When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset. Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.

If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).

That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.

(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.

like image 111
Joachim Sauer Avatar answered Sep 19 '22 16:09

Joachim Sauer


Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.

Some characters (ASCII) are single byte, but many others are multi-byte.

Java supports Unicode, thus according to:

Java Character Docs

The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

like image 28
Michael Avatar answered Sep 21 '22 16:09

Michael