I used RandomAccessFile
to read a byte
from a text file.
public static void readFile(RandomAccessFile fr) { byte[] cbuff = new byte[1]; fr.read(cbuff,0,1); System.out.println(new String(cbuff)); }
Why am I seeing one full character being read by this?
Java was designed for using Unicode Transformed Format (UTF)-16, when the UTF-16 was designed. The 'char' data type in Java originally used for representing 16-bit Unicode. Therefore the size of the char data type in Java is 2 byte, and same for the C language is 1 byte.
In Java, a character is encoded in UTF-16 which uses 2 bytes, while a normal C string is more or less just a bunch of bytes.
The char type takes 1 byte of memory (8 bits) and allows expressing in the binary notation 2^8=256 values. The char type can contain both positive and negative values. The range of values is from -128 to 127.
And, every char is made up of 2 bytes because Java internally uses UTF-16. For instance, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte.
A char
represents a character in Java (*). It is 2 bytes large (or 16 bits).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
When you call the String(byte[])
constructor you ask Java to convert the byte[]
to a String
using the platform's default charset. Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String
containing the Unicode Replacement Character instead).
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[]
and char[]
/String
or between InputStream
and Reader
or between OutputStream
and Writer
, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) that's not entirely true: a char
represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.
Some characters (ASCII) are single byte, but many others are multi-byte.
Java supports Unicode, thus according to:
Java Character Docs
The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With