I need to convert a stream of bytes to a line of UTF-8 characters. The only character that is important for me in that line is the last one. And this conversion should happen in a cycle, so the performance is very important. A simple and inefficient approach would be:
public class Foo {
private ByteArrayOutputStream buffer = new ByteArrayOutputStream();
void next(byte input) {
this.buffer.write(input);
String text = this.buffer.toString("UTF-8"); // this is time consuming
if (text.charAt(text.length() - 1) == THE_CHAR_WE_ARE_WAITING_FOR) {
System.out.println("hurray!");
this.buffer.reset();
}
}
}
Conversion of byte array to string happens on every input byte, which is, in my understanding, very ineffective. Is it possible to do it somehow else to preserve the results of bytes-to-text conversion from a previous cycle?
You can use a simple class to keep track of the characters, and only convert when you got a full UTF8 sequence. Here's a sample (with no error checking which you may want to add)
class UTF8Processor {
private byte[] buffer = new byte[6];
private int count = 0;
public String processByte(byte nextByte) throws UnsupportedEncodingException {
buffer[count++] = nextByte;
if(count == expectedBytes())
{
String result = new String(buffer, 0, count, "UTF-8");
count = 0;
return result;
}
return null;
}
private int expectedBytes() {
int num = buffer[0] & 255;
if(num < 0x80) return 1;
if(num < 0xe0) return 2;
if(num < 0xf0) return 3;
if(num < 0xf8) return 4;
return 5;
}
}
class Bop
{
public static void main (String[] args) throws java.lang.Exception
{
// Create test data.
String str = "Hejsan åäö/漢ya";
byte[] bytes = str.getBytes("UTF-8");
String ch;
// Processes byte by byte, returns a valid UTF8 char when
//there is a complete one to get.
UTF8Processor processor = new UTF8Processor();
for(int i=0; i<bytes.length; i++)
{
if((ch = processor.processByte(bytes[i])) != null)
System.out.println(ch);
}
}
}
Based on the comment:
It's line feed (0x0A)
Your next
method can just check:
if ((char)input == THE_CHAR_WE_ARE_WAITING_FOR) {
//whatever your logic is.
}
You don't have to do any conversion for characters < 128.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With