I want to split a String
to a String[]
array, whose elements meet following conditions.
s.getBytes(encoding).length
should not exceed maxsize(int)
.
If I join the splitted strings with StringBuilder
or +
operator, the result should be exactly the original string.
The input string may have unicode characters which can have multiple bytes when encoded in e.g. UTF-8.
The desired prototype is shown below.
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)
And the testing code:
public boolean isNice(String str, String encoding, int max)
{
//boolean success=true;
StringBuilder b=new StringBuilder();
String[] splitted= SplitStringByByteLength(str,encoding,max);
for(String s: splitted)
{
if(s.getBytes(encoding).length>max)
return false;
b.append(s);
}
if(str.compareTo(b.toString()!=0)
return false;
return true;
}
Though it seems easy when the input string has only ASCII characters, the fact that it could cobtain multibyte characters makes me confused.
Thank you in advance.
Edit: I added my code impementation. (Inefficient)
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
ArrayList<String> splitted=new ArrayList<String>();
StringBuilder builder=new StringBuilder();
//int l=0;
int i=0;
while(true)
{
String tmp=builder.toString();
char c=src.charAt(i);
if(c=='\0')
break;
builder.append(c);
if(builder.toString().getBytes(encoding).length>maxsize)
{
splitted.add(new String(tmp));
builder=new StringBuilder();
}
++i;
}
return splitted.toArray(new String[splitted.size()]);
}
Is this the only way to solve this problem?
The class CharsetEncode
has provision for your requirement. Extract from the Javadoc of the Encode
method:
public final CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...
In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:
...
CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.
A possible code could be:
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
Charset cs = Charset.forName(encoding);
CharsetEncoder coder = cs.newEncoder();
ByteBuffer out = ByteBuffer.allocate(maxsize); // output buffer of required size
CharBuffer in = CharBuffer.wrap(src);
List<String> ss = new ArrayList<>(); // a list to store the chunks
int pos = 0;
while(true) {
CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
int newpos = src.length() - in.length();
String s = src.substring(pos, newpos);
ss.add(s); // add what has been encoded to the list
pos = newpos; // store new input position
out.rewind(); // and rewind output buffer
if (! cr.isOverflow()) {
break; // everything has been encoded
}
}
return ss.toArray(new String[0]);
}
This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).
The problem lies in the existence of Unicode "supplementary characters" (see Javadoc of the Character class), that take up two "character places" (a surrogate pair) in a String, and you shouldn't split your String in the middle of such a pair.
An easy approach to splitting would be to stick to the worst-case that a single Unicode code point can take at most four bytes in UTF-8, and split the string after every 99 code points (using string.offsetByCodePoints(pos, 99)
). In most cases, you won't fill the 400 bytes, but you'll be on the safe side.
Some words about code points and characters
When Java started, Unicode had less than 65536 characters, so Java decided that 16 bits were enough for a character. Later the Unicode standard exceeded the 16-bit limit, and Java had a problem: a single Unicode element (now called a "code point") no longer fit into a single Java character.
They decided to go for an encoding into 16-bit entities, being 1:1 for most usual code points, and occupying two "characters" for the exotic code points beyond the 16-bit limit (the pair built from so-called "surrogate characters" from a spare code range below 65535). So now it can happen that e.g. string.charAt(5)
and string.charAt(6)
must be seen in combination, as a "surrogate pair", together encoding one Unicode code point.
That's the reason why you shouldn't split a string at an arbitrary index.
To help the application programmer, the String class then got a new set of methods, working in code point units, and e.g. string.offsetByCodePoints(pos, 99)
means: from the index pos
, advance by 99 code points forward, giving an index that will often be pos+99
(in case the string doesn't contain anything exotic), but might be up to pos+198
, if all the following string elements happen to be surrogate pairs.
Using the code-point methods, you are safe not to land in the middle of a surrogate pair.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With