Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if a Unicode character is valid

I would like an algorithm or library that can indicate whether a Unicode point is valid. For example U+F8F8 appears not to be a valid Unicode character but is described as "PRIVATE_USE_AREA". I have found ICU - is this a good/best solution?

UPDATE: @Reprogrammer's suggestion (below) is to use:

CoderResult call(CharsetDecoderICU decoder, Object context, 
     ByteBuffer source, CharBuffer target, IntBuffer offsets, 
     char[] buffer, int length, CoderResult cr)
This function is called when the bytes in the source cannot be handled, 
    and this function is meant to handle or fix the error if possible.

Thanks. This looks more complex than I was hoping for - maybe it is necessarily a more complex problem than I thought. (The problem includes points such as '<Non Private Use High Surrogate, First>' (U+D800) which are (I assume) only valid if followed by at least one more code point.

UPDATE: @Jukka writes:

Define “valid”. A Private Use code point is valid as per the Unicode Standard, it just does not have any character assigned to it in the standard. A surrogate code point is not valid character data, but surrogate code units can be used in UTF-16. A Java string is a sequence of code units, not characters; any code unit may appear there, but when you process a string as characters, it should comply with Unicode requirements on characters. – Jukka K. Korpela

I agree that defining "valid" is important. I took the usage from the FileFormat.Info site which declared:

 U+F8F8 is not a valid unicode character.

It seems a fairly authoritative site so I used their term. Maybe they are somewhat imprecise

UPDATE: I have tried @Ignacio's Python into Java but failed. I wrote

public void testUnicode() {
        Pattern pattern = Pattern.compile("\\p{Cn}");
        System.out.println("\\u0020 "+pattern.matcher("\u0020").matches());
        System.out.println("A "+pattern.matcher("A").matches());
        System.out.println("\\uf8f8 "+pattern.matcher("\uf8f8").matches());
    }

which uniformly returned false, even for the "valid" Unicode characters. I also couldn't find \p{Cn} documented.

like image 792
peter.murray.rust Avatar asked Dec 10 '12 04:12

peter.murray.rust


2 Answers

The approach that you describe in a comment to the answer by @IgnacioVazquez-Abrams is a correct one, using matching against patterns like "\\p{Cn}", which test for the General Category (gc) property. But for U+F8F8, this specific match correctly yields false, because this character’s category is not Cn but Cs (Other, surrogate). If you test e.g. for U+FFFF, you get true.

The Unicode categories in major class C (with category name starting with C) are:

  • Cc: Other, control; control characters, e.g. Carriage Return
  • Cf: Other, format; e.g. the soft hyphen (invisible, but may affect formatting)
  • Cs: Other, surrogates; not valid in character data, but may appear, in pairs, in a Java string (which is a string of code units, not characters)
  • Co: Other, private use; valid in character data, but has no character assigned to it by the Unicode standard, and should not be used in information interchange except by private assignments (that assign some meaning to the code point)
  • Cn: Other, not assigned; this may mean that the code point is permanently indicate as noncharacter, or just unassigned, e.g. not assigned yet (but may be assigned to a character in a future version of Unicode)

So when testing for validity, Cn should be rejected (with the reservation that this may cause a rejection of a valid character when the Unicode standard is changed); Cs should be rejected when testing code points, but when processing Java strings, you should accept a pair of Cs characters when the first one is high surrogate and the second one is low surrogate (assuming that you wish to accept characters beyond the Basic Multilingual Plane); and handling of Co depends on whether you wish to treat Private Use code points as valid.

Private Use code points may appear, for example, in data intended to be displayed using a font that has glyphs assigned to such code points. Such fonts are kludgy, but they exist, and the approachis not formally incorrect.

Unicode code points in other major classes are to be treated as characters beyond doubt. This does not mean that an application needs to accept them, just that they validly denote characters.

like image 157
Jukka K. Korpela Avatar answered Oct 15 '22 11:10

Jukka K. Korpela


Try using String.codePointAt
Here is the API:

int java.lang.String.codePointAt(int index)



codePointAt
public int codePointAt(int index)
Returns the character (Unicode code point) at the specified index. 
   The index refers to char values (Unicode code units) and ranges from 0 to length() - 1. 
If the char value specified at the given index is in the high-surrogate range, the 
    following index is less than the length of this String, and the char value at the 
    following index is in the low-surrogate range, then the supplementary code point 
    corresponding to this surrogate pair is returned. Otherwise, the char value at the
    given index is returned. 


Parameters:
index - the index to the char values 
Returns:
the code point value of the character at the index 
Throws: 
IndexOutOfBoundsException - if the index argument is negative or not less than the 
    length of this string.
like image 45
urir Avatar answered Oct 15 '22 10:10

urir