Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why some int from 0x0000 to 0xFFFF is not a defined unicode character

Tags:

java

unicode

I read from the Java doc of Character, that

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP)

But I tried the following code, and found there is 2492 int is not defined! Is there any thing wrong? Or I have some misunderstanding? Thanks!

public static void main( String[] args )
{
    int count=0;
    for(int i = 0x0000; i<0xFFFF;i++)
    {
        if(!Character.isDefined(i))
        {
            count++;
        }
    }
    System.out.println(count);
}

Output :

2492

like image 478
Harry.Chen Avatar asked Jul 06 '15 09:07

Harry.Chen


People also ask

What is the value of 0xFFFFFFFF?

To be clear, 0xffffffff is not -1, it is 4294967295. This value may or may not be representable by an int or an unsigned int (see 5.2.4.2.1p1 ). When the value cannot be represented by an int, converting it to an int has implementation-defined behaviour (see 6.3.1.3p3 ). C++まいる! Cをこわせ! To be clear, 0xffffffff is not -1, it is 4294967295...

Why does Unicode use 16 bit characters?

The Unicode standard was initially designed using 16 bits to encode characters because the primary machines were 16-bit PCs. When the specification for the Java language was created, the Unicode standard was accepted and the char primitive was defined as a 16-bit data type, with characters in the hexadecimal range from 0x0000 to 0xFFFF.

Why Java uses Unicode instead of C?

Java was designed for using Unicode Transformed Format (UTF)-16, when the UTF-16 was designed. The ‘char’ data type in Java originally used for representing 16-bit Unicode. Therefore the size of the char data type in Java is 2 byte, and same for the C language is 1 byte. Hence Java uses Unicode standard.

What does 0xFF = 1111 1111 mean?

FFFF FFFF is 1111 1111 and 'int a' has the first 1 as the flag for negetive number . . how its works. someone can figure it out for me ? Last edited by Idan Damri; 08-18-2014 at 07:13 AM . C++まいる! Cをこわせ! Simply put, 0xFF = 1111 1111 is the 2s complement for -1. Look up 2s complement. It's an encoding scheme used in computers.


1 Answers

The documentation for isDefined() states that a character "is defined" if it has an entry or is in a range in the UnicodeData file. This identifies the set of code points that have been assigned to characters (and it might've been better named isAssigned()). As you discovered, not all of the code points in the Basic Multilingual Plane have been assigned to characters yet (this map shows where some of the empty spaces are).

However, even if a code point has not been assigned (that is, isDefined() is false), it may be assigned in a future version of Unicode, and is still a valid code point. Encoding/decoding and working with unassigned code points is perfectly valid (although, it is a little strange).

like image 103
一二三 Avatar answered Oct 31 '22 16:10

一二三