Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid creating 'new' String objects when converting a byte[] to String using a specific charset

I'm reading from a binary file and want to convert the bytes to US ASCII strings. Is there any way to do this without calling new on String to avoid multiple semantically equal String objects being created in the string literal pool? I'm thinking that it is probably not possible since introducing String objects using double quotes is not possible here. Is this correct?

private String nextString(DataInputStream dis, int size)
throws IOException
{
  byte[] bytesHolder = new byte[size];
  dis.read(bytesHolder);
  return new String(bytesHolder, Charset.forName("US-ASCII")).trim();
like image 693
vahidg Avatar asked Oct 16 '09 14:10

vahidg


People also ask

What is new String [] in Java?

By new keyword : Java String is created by using a keyword “new”. For example: String s=new String(“Welcome”); It creates two objects (in String pool and in heap) and one reference variable where the variable 's' will refer to the object in the heap.

Which of the following ways is correct to convert a byte into long object?

The BigInteger class has a longValue() method to convert a byte array to a long value: long value = new BigInteger(bytes).

Can we convert byte to char in Java?

First, the byte is converted to an int via widening primitive conversion (§5.1. 2), and then the resulting int is converted to a char by narrowing primitive conversion (§5.1. 3).


3 Answers

You'd have to have a cache mapping byte arrays to strings, then search through the cache for any equal values before creating a new string.

You can intern existing strings with intern() as Yishai posted - that won't stop you from creating more strings, but it'll make all but the first one (for any char sequence) very short lived. On the other hand, it'll make all the distinct strings live for a very long time indeed.

You can have "pseudo-interning" by using a Map<String, String>:

String tmp = new String(bytesHolder, Charset.forName("US-ASCII")).trim();
String cached = cache.get(tmp);
if (cached == null)
{
    cached = tmp;
    cache.put(tmp, tmp);
}
return cached;

You could even put a bit more effort in and end up with an LRU cache so that it'll keep the N most recently fetched strings, discarding others when it needs to.

None of that reduces the number of strings created in the first place, as I say - but is that likely to be a problem in your situation? GCs have been tuned to make it very cheap to create short-lived objects.

like image 165
Jon Skeet Avatar answered Sep 28 '22 10:09

Jon Skeet


You can call the intern() method on the string to ensure one for the whole JVM.

String s = new String(bytes, "US-ASCII").intern();

You won't avoid creating the initial string again, but you will save on the storage.

That being said, interned strings have a limited storage space, so use with caution. A better option may be to implement a HashMap with the string as the key and value and check if the string already exists and get it if it does, insert it if it doesn't. That way you won't have such memory limitations.

like image 33
Yishai Avatar answered Sep 28 '22 09:09

Yishai


You shouldn’t be concerned about it—unless you profiled your application and have determined the String creation to be the exact source of your problem.

If you find out that the String creation is the source of your problem I would recommend what Jon Skeet proposed, i.e. a mapping from byte[] to String. That has about the same effect as interning your Strings while not hogging up valuable memory until you restart the VM.

like image 36
Bombe Avatar answered Sep 28 '22 10:09

Bombe