Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to avoid memory wastage when storing UTF-8 characters (8 bit) in Java character (16 bit). two in one?

I'm afraid I have a question on a detail of a rather oversaturated topic, I searched aroudn a lot, but couldn't find a clear answer to that specific obvious -imho- important, problem:

When converting byte[] to String using UTF-8, each byte (8bit) becomes a 8 bit character encoded by UTF-8, but each UTF-8 character is saved as a 16 bit character in java. Is that correct? If yes, this means, that each stupid java character only uses the first 8 bits, and consumes double the memory? Is that correct too? I wonder how this wasteful behaviour is acceptable..

Isn't there some trick to have a pseudo String that is 8 bit? Would that actually result in less memory consumption? Or maybe, is there a way to store >two< 8bit characters in one java 16bit character to avoid this memory waste?

thanks for any deconfusing answers...

EDIT: hi, thanks everybody for answering. I was aware of the variable-length property of UTF-8. However, since my source is byte which is 8 bit, I understood (apparently wrongly) that it needs only 8-bit UTF-8 words. Is UTF-8 conversion actually saving the strange symbols that you see when on the CLI you do "cat somebinary" ? I thought UTF-8 was just somehow used to map each of the possible 8bit words of byte to one particular 8 bit word of UTF-8. Wrong? I thought about using Base64 but it's bad because it uses only 7 bit..

questions reformulated: is there a smarter way to convert byte to something String? May favorite was to just cast byte[] to char[], but then I still have 16bit words.

additional use case info:

I'm adapting Jedis (java client for the NoSQL Redis) as the "primitive storage layer" for hypergraphDB. So, jedis is a database for another "database". My problem is that I have to feed jedis with byte[] data all the time, but internally, >Redis< (the actual server) is dealing only with "binary safe" Strings. Since Redis is written in C, a char is 8 bit long, AFAIK not ASCIII which is 7 bit. In Jedis however, java world, every character is 16 bit long internally. I don't understand this code (yet), but I suppose jedis then converts this java 16 bit strings to a Redis conforming 8 bit string (([here][3]). It says it extends FilterOutputStream. My hope is to bypass the byte[] <-> string conversion altogether and use that Filteroutputstream...? )

now I wonder: if I had to interconvert byte[] and String all the time, with datasizes ranging from very small to potentially very big, isn't there a huge waste of memory to have each 8 bit character passed around as 16bit within java?

like image 789
ib84 Avatar asked Apr 12 '11 12:04

ib84


2 Answers

Isn't there some trick to have a pseudo String that is 8 bit?

yes, make sure you have an up to date version of Java. ;)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

-XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

EDIT: This option doesn't work in Java 6 update 22 and is not on by default in Java 6 update 24. Note: it appears this option may slow performance by about 10%.

The following program

public static void main(String... args) throws IOException {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
        sb.append(i);

    for (int j = 0; j < 10; j++)
        test(sb, j >= 2);
}

private static void test(StringBuilder sb, boolean print) {
    List<String> strings = new ArrayList<String>();
    forceGC();
    long free = Runtime.getRuntime().freeMemory();

    long size = 0;
    for (int i = 0; i < 100; i++) {
        final String s = "" + sb + i;
        strings.add(s);
        size += s.length();
    }
    forceGC();
    long used = free - Runtime.getRuntime().freeMemory();
    if (print)
        System.out.println("Bytes per character is " + (double) used / size);
}

private static void forceGC() {
    try {
        System.gc();
        Thread.sleep(250);
        System.gc();
        Thread.sleep(250);
    } catch (InterruptedException e) {
        throw new AssertionError(e);
    }
}

Prints this by default

Bytes per character is 2.0013668655941212
Bytes per character is 2.0013668655941212
Bytes per character is 2.0013606946433575
Bytes per character is 2.0013668655941212

with the option -XX:+UseCompressedStrings

Bytes per character is 1.0014671435440285
Bytes per character is 1.0014671435440285
Bytes per character is 1.0014609725932648
Bytes per character is 1.0014671435440285
like image 159
Peter Lawrey Avatar answered Oct 01 '22 19:10

Peter Lawrey


Actually, you have the UTF-8 part wrong: UTF-8 is a variable-length multibyte encoding, so there are valid characters 1-4 bytes in length (in other words, some UTF-8 characters are 8-bit, some are 16-bit, some are 24-bit, and some are 32-bit). Although the 1-byte characters take up 8 bits, there are many more multibyte characters. If you only had 1-byte characters, it would only allow you to have 256 different characters in total (a.k.a. "Extended ASCII"); that may be sufficient for 90% of use in English (my naïve guesstimate), but would bite you in the ass as soon as you even think of anything beyond that subset (see the word naïve - English, yet can't be written just with ASCII).

So, although UTF-16 (which Java uses) looks wasteful, it's actually not. Anyway, unless you're on a very limited embedded system (in which case, what you're doing there with Java?), trying to trim down the strings is pointless microoptimization.

For a slightly longer introduction to character encodings, see e.g. this: http://www.joelonsoftware.com/articles/Unicode.html

like image 22
Piskvor left the building Avatar answered Oct 01 '22 20:10

Piskvor left the building