how to avoid memory wastage when storing UTF-8 characters (8 bit) in Java character (16 bit). two in one?

Question

I'm afraid I have a question on a detail of a rather oversaturated topic, I searched aroudn a lot, but couldn't find a clear answer to that specific obvious -imho- important, problem:

When converting byte[] to String using UTF-8, each byte (8bit) becomes a 8 bit character encoded by UTF-8, but each UTF-8 character is saved as a 16 bit character in java. Is that correct? If yes, this means, that each stupid java character only uses the first 8 bits, and consumes double the memory? Is that correct too? I wonder how this wasteful behaviour is acceptable..

Isn't there some trick to have a pseudo String that is 8 bit? Would that actually result in less memory consumption? Or maybe, is there a way to store >two< 8bit characters in one java 16bit character to avoid this memory waste?

thanks for any deconfusing answers...

EDIT: hi, thanks everybody for answering. I was aware of the variable-length property of UTF-8. However, since my source is byte which is 8 bit, I understood (apparently wrongly) that it needs only 8-bit UTF-8 words. Is UTF-8 conversion actually saving the strange symbols that you see when on the CLI you do "cat somebinary" ? I thought UTF-8 was just somehow used to map each of the possible 8bit words of byte to one particular 8 bit word of UTF-8. Wrong? I thought about using Base64 but it's bad because it uses only 7 bit..

questions reformulated: is there a smarter way to convert byte to something String? May favorite was to just cast byte[] to char[], but then I still have 16bit words.

additional use case info:

I'm adapting Jedis (java client for the NoSQL Redis) as the "primitive storage layer" for hypergraphDB. So, jedis is a database for another "database". My problem is that I have to feed jedis with byte[] data all the time, but internally, >Redis< (the actual server) is dealing only with "binary safe" Strings. Since Redis is written in C, a char is 8 bit long, AFAIK not ASCIII which is 7 bit. In Jedis however, java world, every character is 16 bit long internally. I don't understand this code (yet), but I suppose jedis then converts this java 16 bit strings to a Redis conforming 8 bit string (([here][3]). It says it extends FilterOutputStream. My hope is to bypass the byte[] <-> string conversion altogether and use that Filteroutputstream...? )

now I wonder: if I had to interconvert byte[] and String all the time, with datasizes ranging from very small to potentially very big, isn't there a huge waste of memory to have each 8 bit character passed around as 16bit within java?

Peter Lawrey · Accepted Answer

Isn't there some trick to have a pseudo String that is 8 bit?

yes, make sure you have an up to date version of Java. ;)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

-XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

EDIT: This option doesn't work in Java 6 update 22 and is not on by default in Java 6 update 24. Note: it appears this option may slow performance by about 10%.

The following program

public static void main(String... args) throws IOException {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
        sb.append(i);

    for (int j = 0; j < 10; j++)
        test(sb, j >= 2);
}

private static void test(StringBuilder sb, boolean print) {
    List<String> strings = new ArrayList<String>();
    forceGC();
    long free = Runtime.getRuntime().freeMemory();

    long size = 0;
    for (int i = 0; i < 100; i++) {
        final String s = "" + sb + i;
        strings.add(s);
        size += s.length();
    }
    forceGC();
    long used = free - Runtime.getRuntime().freeMemory();
    if (print)
        System.out.println("Bytes per character is " + (double) used / size);
}

private static void forceGC() {
    try {
        System.gc();
        Thread.sleep(250);
        System.gc();
        Thread.sleep(250);
    } catch (InterruptedException e) {
        throw new AssertionError(e);
    }
}

Prints this by default

Bytes per character is 2.0013668655941212
Bytes per character is 2.0013668655941212
Bytes per character is 2.0013606946433575
Bytes per character is 2.0013668655941212

with the option -XX:+UseCompressedStrings

Bytes per character is 1.0014671435440285
Bytes per character is 1.0014671435440285
Bytes per character is 1.0014609725932648
Bytes per character is 1.0014671435440285

Piskvor left the building · Answer

Actually, you have the UTF-8 part wrong: UTF-8 is a variable-length multibyte encoding, so there are valid characters 1-4 bytes in length (in other words, some UTF-8 characters are 8-bit, some are 16-bit, some are 24-bit, and some are 32-bit). Although the 1-byte characters take up 8 bits, there are many more multibyte characters. If you only had 1-byte characters, it would only allow you to have 256 different characters in total (a.k.a. "Extended ASCII"); that may be sufficient for 90% of use in English (my naïve guesstimate), but would bite you in the ass as soon as you even think of anything beyond that subset (see the word naïve - English, yet can't be written just with ASCII).

So, although UTF-16 (which Java uses) looks wasteful, it's actually not. Anyway, unless you're on a very limited embedded system (in which case, what you're doing there with Java?), trying to trim down the strings is pointless microoptimization.

For a slightly longer introduction to character encodings, see e.g. this: http://www.joelonsoftware.com/articles/Unicode.html

So, although UTF-16 (which Java uses) looks wasteful, it's actually not. Anyway, unless you're on a very limited embedded system (in which case, what you're doing there with Java?), trying to trim down the strings is pointless microoptimization.

For a slightly longer introduction to character encodings, see e.g. this: http://www.joelonsoftware.com/articles/Unicode.html

how to avoid memory wastage when storing UTF-8 characters (8 bit) in Java character (16 bit). two in one?

Tags:

java

memory

utf-8

byte

8-bit

ib84

2 Answers

Peter Lawrey

Piskvor left the building

Recent Activity

Donate For Us

how to avoid memory wastage when storing UTF-8 characters (8 bit) in Java character (16 bit). two in one?

Tags:

java

memory

utf-8

byte

8-bit

ib84

2 Answers

Peter Lawrey

Piskvor left the building

Related questions

Recent Activity

Donate For Us