A java string containing special chars such as ç
takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.
How can I count correctly the number of bytes in a String?
Example:
The word endereço
should return me length 9 instead of 8.
The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.
SORACOM uses the following for calculating byte conversion: 1 gigabyte (GB) = 1,024 megabytes (MB) 1 megabyte (MB) = 1,024 kilobytes (kB) 1 kilobyte (kB) = 1,024 bytes (B)
A 16 byte field can hold up to 16 ASCII characters, or perhaps 8 CJK glyphs that might encode a short kanji or hanzi password.
Method #2 : Using sys.getsizeof() This task can also be performed by one of the system calls, offered by Python as in sys function library, the getsizeof function can get us the size in bytes of desired string.
The word endereço should return me length 9 instead of 8.
If you expect to have a size of 9 bytes for the "endereço"
String that has a length of 8 characters : 7 ASCII
characters and 1 not ASCII
character, I suppose that you want to use UTF-8
charset that uses 1 byte for characters included in the ASCII table and more for the others.
but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.
String
length()
method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply char
s are contained in?"
String
length()
Javadoc :
Returns the length of this string. The length is equal to the number of Unicode code units in the string.
The byte[]
getBytes()
method with no argument encodes the String into a byte array. You could use the length
property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding.
But the byte[]
getBytes()
method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.
byte[]
getBytes()
Javadoc :
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.
In your String example "endereço"
, if getBytes()
returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8
but a charset using 1 byte fixed width by character such as ISO 8859-1
and its derived charsets such as windows-1252
for Windows OS based.
To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset()
.
Solution
byte[]
getBytes()
method comes with two other very useful overloads :
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
byte[] java.lang.String.getBytes(Charset charset)
Contrary to the getBytes()
method with no argument, these methods allow to specify the charset to use during the byte encoding.
byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
Javadoc :
Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.
The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.
byte[] java.lang.String.getBytes(Charset charset)
Javadoc :
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.
You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .
For example to get an UTF-8
encoding byte array by using getBytes(String charsetName)
you can do that :
String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;
And you will get a length of 9 bytes as you wish.
Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8
and UTF-16
:
public static void main(String[] args) throws UnsupportedEncodingException {
// default charset
Charset defaultCharset = Charset.defaultCharset();
System.out.println("default charset = " + defaultCharset);
// String sample
String yourString = "endereço";
// getBytes() with default platform encoding
System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());
// getBytes() with specific charset UTF-8
System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);
System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());
// getBytes() with specific charset UTF-16
System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);
System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}
Output on my machine that is Windows OS based:
default charset = windows-1252
getBytes() with default charset, size = 8
getBytes("UTF-8"), size = 9
getBytes(StandardCharsets.UTF_8), size = 9
getBytes("UTF-16"), size = 18
getBytes(StandardCharsets.UTF_16), size = 18
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With