Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java substring broken encoding

I read some data from stream in UTF-8 encoding

String line = new String(byteArray, "UTF-8");

then try to find some subsequence

int startPos = line.indexOf(tag) + tag.length();
int endPos   = line.indexOf("/", startPos);

and cut it

String name = line.substring(startPos, endPos);

In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc. It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.

How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?

like image 879
n00bot Avatar asked Oct 11 '13 11:10

n00bot


1 Answers

The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.

By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.

like image 126
andrel Avatar answered Sep 29 '22 22:09

andrel