How to count String bytes properly?

Tags:

A java string containing special chars such as ç takes two bytes of size in each special char, but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.

How can I count correctly the number of bytes in a String?

Example:

The word endereço should return me length 9 instead of 8.

419

asked Apr 03 '17 22:04

Philippe Gioseffi

1 Answers

The word endereço should return me length 9 instead of 8.

If you expect to have a size of 9 bytes for the "endereço" String that has a length of 8 characters : 7 ASCII characters and 1 not ASCII character, I suppose that you want to use UTF-8 charset that uses 1 byte for characters included in the ASCII table and more for the others.

but String length method or getting the length of it with the byte array returned from getBytes method doesn't return special chars counted as two bytes.

String length() method doesn't answer to the question : how many bytes are used ? But answer to : "how many "UTF-16 code units" or more simply chars are contained in?"

String length() Javadoc :

Returns the length of this string. The length is equal to the number of Unicode code units in the string.

The byte[] getBytes() method with no argument encodes the String into a byte array. You could use the length property of the returned array to know how many bytes are used by the encoded String but the result will depend on the charset used during the encoding. But the byte[] getBytes() method doesn't allow to specify the charset : it uses the platform's default charset.
So, using it may not give the expected result if the underlying OS uses by default a charset that is not which one that you want to use to encode your Strings in bytes.
Besides, according to the platform where the application is deployed, the way which the String are encoded in bytes may change. Which may be undesirable.
At last, if the String cannot be encoded in the default charset, the behavior is unspecified.
So, this method should be used with very caution or not used at all.

byte[] getBytes() Javadoc :

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the default charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

In your String example "endereço", if getBytes() returns a array with a size of 8 and not 9, it means that your OS doesn't use by default UTF-8 but a charset using 1 byte fixed width by character such as ISO 8859-1 and its derived charsets such as windows-1252 for Windows OS based.

To know the default charset of the current Java virtual machine where the application runs, you can use this utility method : Charset defaultCharset = Charset.defaultCharset().

Solution

byte[] getBytes() method comes with two other very useful overloads :

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException
byte[] java.lang.String.getBytes(Charset charset)

Contrary to the getBytes() method with no argument, these methods allow to specify the charset to use during the byte encoding.

byte[] java.lang.String.getBytes(String charsetName) throws UnsupportedEncodingException Javadoc :

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

The behavior of this method when this string cannot be encoded in the given charset is unspecified. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

byte[] java.lang.String.getBytes(Charset charset) Javadoc :

Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array. The java.nio.charset.CharsetEncoder class should be used when more control over the encoding process is required.

You may use one or the other one (while there are some intricacies between them) to encode your String in a byte array with UTF-8 or any other charset and so get its size for this specific charset .

For example to get an UTF-8 encoding byte array by using getBytes(String charsetName) you can do that :

String yourString = "endereço";
byte[] bytes = yourString.getBytes("UTF-8");
int sizeInBytes = bytes.length;

And you will get a length of 9 bytes as you wish.

Here is a more comprehensive example with default encoding displayed, byte encoding with default charset platform, UTF-8 and UTF-16 :

public static void main(String[] args) throws UnsupportedEncodingException {

    // default charset
    Charset defaultCharset = Charset.defaultCharset();
    System.out.println("default charset = " + defaultCharset);

    // String sample
    String yourString = "endereço";

    //  getBytes() with default platform encoding
    System.out.println("getBytes() with default charset, size = " + yourString.getBytes().length + System.lineSeparator());

    // getBytes() with specific charset UTF-8
    System.out.println("getBytes(\"UTF-8\"), size = " + yourString.getBytes("UTF-8").length);       
    System.out.println("getBytes(StandardCharsets.UTF_8), size = " + yourString.getBytes(StandardCharsets.UTF_8).length + System.lineSeparator());

    // getBytes() with specific charset UTF-16      
    System.out.println("getBytes(\"UTF-16\"), size = " + yourString.getBytes("UTF-16").length);     
    System.out.println("getBytes(StandardCharsets.UTF_16), size = " + yourString.getBytes(StandardCharsets.UTF_16).length);
}

Output on my machine that is Windows OS based:

default charset = windows-1252

getBytes() with default charset, size = 8

getBytes("UTF-8"), size = 9

getBytes(StandardCharsets.UTF_8), size = 9

getBytes("UTF-16"), size = 18

getBytes(StandardCharsets.UTF_16), size = 18

102

answered Oct 21 '22 14:10

davidxxx

Related questions
                            
                                How to show different layouts in recyclerView?
                            
                                Multithreading with Jersey
                            
                                How to exclude specific TIFF reader from ImageIO?
                            
                                Reading an input stream twice without storing it in memory
                            
                                How to @Autowired a List<Integer> in spring framework
                            
                                Java Lambda stream into different collections
                            
                                Check if date between date range that also handle null values Java
                            
                                Java Future - Spring Authentication is null into AuditorAware
                            
                                Spring Boot Actuator: How to get metrics uptime inside a custom HealthIndicator?
                            
                                Mockito doAnswer & thenReturn in one method
                            
                                Elastic Beanstalk .ebextensions ignored in WAR
                            
                                JPA CriteriaQuery compare Timestamp ignore time portion
                            
                                Can Spring boot dynamically create end points based on the content of the property file?
                            
                                Why is the okhttp3.Response class final
                            
                                How can i open Activity when notification click
                            
                                loading inner class without loading the enclosing class
                            
                                how to use redis to persist token using spring-security-oauth2
                            
                                SpringApplicationConfiguration cannot be resolved in a Spring Boot test
                            
                                Syntax highlighting on android EditText using Span?
                            
                                java - curry an existing static function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to count String bytes properly?

Tags:

java

string

encoding

utf-8

Philippe Gioseffi

People also ask

1 Answers

davidxxx

Recent Activity

Donate For Us