Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split Java String in chunks of 1024 bytes

What's an efficient way of splitting a String into chunks of 1024 bytes in java? If there is more than one chunk then the header(fixed size string) needs to be repeated in all subsequent chunks.

like image 328
user54729 Avatar asked Feb 06 '09 16:02

user54729


1 Answers

You have two ways, the fast and the memory conservative way. But first, you need to know what characters are in the String. ASCII? Are there umlauts (characters between 128 and 255) or even Unicode (s.getChar() returns something > 256). Depending on that, you will need to use a different encoding. If you have binary data, try "iso-8859-1" because it will preserve the data in the String. If you have Unicode, try "utf-8". I'll assume binary data:

String encoding = "iso-8859-1";

The fastest way:

ByteArrayInputStream in = new ByteArrayInputStream (string.getBytes(encoding));

Note that the String is Unicode, so every character needs two bytes. You will have to specify the encoding (don't rely on the "platform default". This will only cause pain later).

Now you can read it in 1024 chunks using

byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) > 0) { ... }

This needs about three times as much RAM as the original String.

A more memory conservative way is to write a converter which takes a StringReader and an OutputStreamWriter (which wraps a ByteArrayOutputStream). Copy bytes from the reader to the writer until the underlying buffer contains one chunk of data:

When it does, copy the data to the real output (prepending the header), copy the additional bytes (which the Unicode->byte conversion may have generated) to a temp buffer, call buffer.reset() and write the temp buffer to buffer.

Code looks like this (untested):

StringReader r = new StringReader (string);
ByteArrayOutputStream buffer = new ByteArrayOutputStream (1024*2); // Twice as large as necessary
OutputStreamWriter w = new OutputStreamWriter  (buffer, encoding);

char[] cbuf = new char[100];
byte[] tempBuf;
int len;
while ((len = r.read(cbuf, 0, cbuf.length)) > 0) {
    w.write(cbuf, 0, len);
    w.flush();
    if (buffer.size()) >= 1024) {
        tempBuf = buffer.toByteArray();
        ... ready to process one chunk ...
        buffer.reset();
        if (tempBuf.length > 1024) {
            buffer.write(tempBuf, 1024, tempBuf.length - 1024);
        }
    }
}
... check if some data is left in buffer and process that, too ...

This only needs a couple of kilobytes of RAM.

[EDIT] There has been a lengthy discussion about binary data in Strings in the comments. First of all, it's perfectly safe to put binary data into a String as long as you are careful when creating it and storing it somewhere. To create such a String, take a byte[] array and:

String safe = new String (array, "iso-8859-1");

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping. This means the bytes in the array will not be interpreted in any way. Now you can use substring() and the like on the data or search it with index, run regexp's on it, etc. For example, find the position of a 0-byte:

int pos = safe.indexOf('\u0000');

This is especially useful if you don't know the encoding of the data and want to have a look at it before some codec messes with it.

To write the data somewhere, the reverse operation is:

byte[] data = safe.getBytes("iso-8859-1");

Never use the default methods new String(array) or String.getBytes()! One day, your code is going to be executed on a different platform and it will break.

Now the problem of characters > 255 in the String. If you use this method, you won't ever have any such character in your Strings. That said, if there were any for some reason, then getBytes() would throw an Exception because there is no way to express all Unicode characters in ISO-Latin1, so you're safe in the sense that the code will not fail silently.

Some might argue that this is not safe enough and you should never mix bytes and String. In this day an age, we don't have that luxury. A lot of data has no explicit encoding information (files, for example, don't have an "encoding" attribute in the same way as they have access permissions or a name). XML is one of the few formats which has explicit encoding information and there are editors like Emacs or jEdit which use comments to specify this vital information. This means that, when processing streams of bytes, you must always know in which encoding they are. As of now, it's not possible to write code which will always work, no matter where the data comes from.

Even with XML, you must read the header of the file as bytes to determine the encoding before you can decode the meat.

The important point is to sit down and figure out which encoding was used to generate the data stream you have to process. If you do that, you're good, if you don't, you're doomed. The confusion originates from the fact that most people are not aware that the same byte can mean different things depending on the encoding or even that there is more than one encoding. Also, it would have helped if Sun hadn't introduced the notion of "platform default encoding."

Important points for beginners:

  • There is more than one encoding (charset).
  • There are more characters than the English language uses. There are even several sets of digits (ASCII, full width, Arabic-Indic, Bengali).
  • You must know which encoding was used to generate the data which you are processing.
  • You must know which encoding you should use to write the data you are processing.
  • You must know the correct way to specify this encoding information so the next program can decode your output (XML header, HTML meta tag, special encoding comment, whatever).

The days of ASCII are over.

like image 81
Aaron Digulla Avatar answered Sep 28 '22 10:09

Aaron Digulla