I'm afraid I have a question on a detail of a rather oversaturated topic, I searched aroudn a lot, but couldn't find a clear answer to that specific obvious -imho- important, problem:
When converting byte[] to String using UTF-8, each byte (8bit) becomes a 8 bit character encoded by UTF-8, but each UTF-8 character is saved as a 16 bit character in java. Is that correct? If yes, this means, that each stupid java character only uses the first 8 bits, and consumes double the memory? Is that correct too? I wonder how this wasteful behaviour is acceptable..
Isn't there some trick to have a pseudo String that is 8 bit? Would that actually result in less memory consumption? Or maybe, is there a way to store >two< 8bit characters in one java 16bit character to avoid this memory waste?
thanks for any deconfusing answers...
EDIT: hi, thanks everybody for answering. I was aware of the variable-length property of UTF-8. However, since my source is byte which is 8 bit, I understood (apparently wrongly) that it needs only 8-bit UTF-8 words. Is UTF-8 conversion actually saving the strange symbols that you see when on the CLI you do "cat somebinary" ? I thought UTF-8 was just somehow used to map each of the possible 8bit words of byte to one particular 8 bit word of UTF-8. Wrong? I thought about using Base64 but it's bad because it uses only 7 bit..
questions reformulated: is there a smarter way to convert byte to something String? May favorite was to just cast byte[] to char[], but then I still have 16bit words.
additional use case info:
I'm adapting Jedis (java client for the NoSQL Redis) as the "primitive storage layer" for hypergraphDB. So, jedis is a database for another "database". My problem is that I have to feed jedis with byte[] data all the time, but internally, >Redis< (the actual server) is dealing only with "binary safe" Strings. Since Redis is written in C, a char is 8 bit long, AFAIK not ASCIII which is 7 bit. In Jedis however, java world, every character is 16 bit long internally. I don't understand this code (yet), but I suppose jedis then converts this java 16 bit strings to a Redis conforming 8 bit string (([here][3]). It says it extends FilterOutputStream. My hope is to bypass the byte[] <-> string conversion altogether and use that Filteroutputstream...? )
now I wonder: if I had to interconvert byte[] and String all the time, with datasizes ranging from very small to potentially very big, isn't there a huge waste of memory to have each 8 bit character passed around as 16bit within java?
Isn't there some trick to have a pseudo String that is 8 bit?
yes, make sure you have an up to date version of Java. ;)
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
-XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)
EDIT: This option doesn't work in Java 6 update 22 and is not on by default in Java 6 update 24. Note: it appears this option may slow performance by about 10%.
The following program
public static void main(String... args) throws IOException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 10000; i++)
sb.append(i);
for (int j = 0; j < 10; j++)
test(sb, j >= 2);
}
private static void test(StringBuilder sb, boolean print) {
List<String> strings = new ArrayList<String>();
forceGC();
long free = Runtime.getRuntime().freeMemory();
long size = 0;
for (int i = 0; i < 100; i++) {
final String s = "" + sb + i;
strings.add(s);
size += s.length();
}
forceGC();
long used = free - Runtime.getRuntime().freeMemory();
if (print)
System.out.println("Bytes per character is " + (double) used / size);
}
private static void forceGC() {
try {
System.gc();
Thread.sleep(250);
System.gc();
Thread.sleep(250);
} catch (InterruptedException e) {
throw new AssertionError(e);
}
}
Prints this by default
Bytes per character is 2.0013668655941212
Bytes per character is 2.0013668655941212
Bytes per character is 2.0013606946433575
Bytes per character is 2.0013668655941212
with the option -XX:+UseCompressedStrings
Bytes per character is 1.0014671435440285
Bytes per character is 1.0014671435440285
Bytes per character is 1.0014609725932648
Bytes per character is 1.0014671435440285
Actually, you have the UTF-8 part wrong: UTF-8 is a variable-length multibyte encoding, so there are valid characters 1-4 bytes in length (in other words, some UTF-8 characters are 8-bit, some are 16-bit, some are 24-bit, and some are 32-bit). Although the 1-byte characters take up 8 bits, there are many more multibyte characters. If you only had 1-byte characters, it would only allow you to have 256 different characters in total (a.k.a. "Extended ASCII"); that may be sufficient for 90% of use in English (my naïve guesstimate), but would bite you in the ass as soon as you even think of anything beyond that subset (see the word naïve - English, yet can't be written just with ASCII).
So, although UTF-16 (which Java uses) looks wasteful, it's actually not. Anyway, unless you're on a very limited embedded system (in which case, what you're doing there with Java?), trying to trim down the strings is pointless microoptimization.
For a slightly longer introduction to character encodings, see e.g. this: http://www.joelonsoftware.com/articles/Unicode.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With