I read some data from stream in UTF-8 encoding
String line = new String(byteArray, "UTF-8");
then try to find some subsequence
int startPos = line.indexOf(tag) + tag.length();
int endPos = line.indexOf("/", startPos);
and cut it
String name = line.substring(startPos, endPos);
In most cases it works fine, but some times result is broken. For example, for input name like "гордунни"
I got values like "горд��нни"
, "горду��ни"
, "г��рдунни"
etc.
It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.
How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?
The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.
By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With