Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert stream of bytes to UTF-8 characters?

I need to convert a stream of bytes to a line of UTF-8 characters. The only character that is important for me in that line is the last one. And this conversion should happen in a cycle, so the performance is very important. A simple and inefficient approach would be:

public class Foo {
  private ByteArrayOutputStream buffer = new ByteArrayOutputStream();
  void next(byte input) {
    this.buffer.write(input);
    String text = this.buffer.toString("UTF-8"); // this is time consuming
    if (text.charAt(text.length() - 1) == THE_CHAR_WE_ARE_WAITING_FOR) {
      System.out.println("hurray!");
      this.buffer.reset();
    }   
  }
}

Conversion of byte array to string happens on every input byte, which is, in my understanding, very ineffective. Is it possible to do it somehow else to preserve the results of bytes-to-text conversion from a previous cycle?

like image 265
yegor256 Avatar asked Jun 23 '13 06:06

yegor256


2 Answers

You can use a simple class to keep track of the characters, and only convert when you got a full UTF8 sequence. Here's a sample (with no error checking which you may want to add)

class UTF8Processor {
    private byte[] buffer = new byte[6];
    private int count = 0;

    public String processByte(byte nextByte) throws UnsupportedEncodingException {
        buffer[count++] = nextByte;
        if(count == expectedBytes())
        {
            String result = new String(buffer, 0, count, "UTF-8");
            count = 0;
            return result;
        }
        return null;
    }

    private int expectedBytes() {
        int num = buffer[0] & 255;
        if(num < 0x80) return 1;
        if(num < 0xe0) return 2;
        if(num < 0xf0) return 3;
        if(num < 0xf8) return 4;
        return 5;
    }
}

class Bop
{
    public static void main (String[] args) throws java.lang.Exception
    {
        // Create test data.
        String str = "Hejsan åäö/漢ya";
        byte[] bytes = str.getBytes("UTF-8");

        String ch;

        // Processes byte by byte, returns a valid UTF8 char when 
        //there is a complete one to get.

        UTF8Processor processor = new UTF8Processor();

        for(int i=0; i<bytes.length; i++)
        {
            if((ch = processor.processByte(bytes[i])) != null)
                System.out.println(ch);
        }
    }
}
like image 61
Joachim Isaksson Avatar answered Sep 22 '22 16:09

Joachim Isaksson


Based on the comment:

It's line feed (0x0A)

Your next method can just check:

if ((char)input == THE_CHAR_WE_ARE_WAITING_FOR) {
    //whatever your logic is.
}

You don't have to do any conversion for characters < 128.

like image 33
Aurand Avatar answered Sep 26 '22 16:09

Aurand