Java substring broken encoding

Question

I read some data from stream in UTF-8 encoding

String line = new String(byteArray, "UTF-8");

then try to find some subsequence

int startPos = line.indexOf(tag) + tag.length();
int endPos   = line.indexOf("/", startPos);

and cut it

String name = line.substring(startPos, endPos);

In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc. It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.

How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?

andrel · Accepted Answer

The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.

By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.

Java substring broken encoding

Tags:

java

substring

utf-8

n00bot

1 Answers

andrel

Recent Activity

Donate For Us

Java substring broken encoding

Tags:

java

substring

utf-8

n00bot

1 Answers

andrel

Related questions

Recent Activity

Donate For Us