I have a situation where I need to know the size of a String
/encoding pair, in bytes, but cannot use the getBytes()
method because 1) the String
is very large and duplicating the String
in a byte[]
array would use a large amount of memory, but more to the point 2) getBytes()
allocates a byte[]
array based on the length of the String
* the maximum possible bytes per character. So if I have a String
with 1.5B characters and UTF-16 encoding, getBytes()
will try to allocate a 3GB array and fail, since arrays are limited to 2^32 - X bytes (X is Java version specific).
So - is there some way to calculate the byte size of a String
/encoding pair directly from the String
object?
UPDATE:
Here's a working implementation of jtahlborn's answer:
private class CountingOutputStream extends OutputStream {
int total;
@Override
public void write(int i) {
throw new RuntimeException("don't use");
}
@Override
public void write(byte[] b) {
total += b.length;
}
@Override public void write(byte[] b, int offset, int len) {
total += len;
}
}
So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since . NET strings are UTF-16.
Byte objects are sequence of Bytes, whereas Strings are sequence of characters. Byte objects are in machine readable form internally, Strings are only in human readable form. Since Byte objects are machine readable, they can be directly stored on the disk.
If you want the size of the string in bytes, you can use the getsizeof() method from the sys module.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes.
Simple, just write it to a dummy output stream:
class CountingOutputStream extends OutputStream {
private int _total;
@Override public void write(int b) {
++_total;
}
@Override public void write(byte[] b) {
_total += b.length;
}
@Override public void write(byte[] b, int offset, int len) {
_total += len;
}
public int getTotalSize(){
_total;
}
}
CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);
// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
int end = Math.min(myString.length(), i+8096);
writer.write(myString, i, end - i);
}
writer.flush();
System.out.println("Total bytes: " + cos.getTotalSize());
it's not only simple, but probably just as fast as the other "complex" answers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With