Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a Unicode G-Clef (U+1D11E) from a file?

Tags:

java

unicode

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?

Update

I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.

It is possible to build a code point with Character.toCodePoint() but this function requires a char. On the other side reading a char is not possible because read() returns an int. My best work around so far is this but it still contains unsafe casts:

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

How to do it better?

Update 2

Another version returning a String but still using casts:

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

The remaining question: are the casts safe or how to avoid them?

like image 306
ceving Avatar asked Jun 28 '13 09:06

ceving


1 Answers

My best work around so far is this but it still contains unsafe casts

The only unsafe thing about the code you've presented is that ch16 might be -1 if input has reached EOF. If you check for this condition first then you can guarantee that the other (char) casts are safe as Reader.read() is specified to return either -1 or a value that is within the range of char (0 - 0xFFFF).

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
    return ch16;
  else {
    int loSurr = input.read();
    if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
      return ch16; // or possibly throw an exception
    else 
      return Character.toCodePoint((char)ch16, (char)loSurr);
  }
}

This still isn't ideal, really you need to handle the edge case where the first char read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true. If you can guarantee that then how about

public int read_code_point (Reader input) throws java.io.IOException
{
  int firstChar = input.read();
  if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
    return firstChar;
  } else {
    input.mark(1);
    int secondChar = input.read();
    if(secondChar < 0) {
      // reached EOF
      return firstChar;
    } else if(!Character.isLowSurrogate((char)secondChar)) {
      // unpaired surrogates, un-read the second char
      input.reset();
      return firstChar;
    }
    else {
      return Character.toCodePoint((char)firstChar, (char)secondChar);
    }
  }
}

Or you could wrap the original reader in a PushbackReader and use unread(secondChar)

like image 198
Ian Roberts Avatar answered Oct 07 '22 15:10

Ian Roberts