Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a single character code to a `char` given a character set?

Tags:

java

ascii

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.

public static void main(String[] args) {
    char ret= (char)146;  
    System.out.println(ret);// returns nothing. 

I expect to get character single "'" as per http://www.ascii-code.com/ Anyone came across this? Thanks.

like image 531
Paresh Avatar asked Feb 05 '23 04:02

Paresh


1 Answers

So, a couple of things.

First of all the page you linked to says this about the code point range in question:

The extended ASCII codes (character code 128-255)

There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.

This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:

String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);

Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:

String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);

So the first mistake is that page is confusing.

But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).

Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.

So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:

String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);

You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.

However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.

As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:

Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);

Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).


As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.

Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:

System.out.println((char)0x2019);

You can also see this for yourself by looking at the value after the conversion from windows-1252:

String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019

Or, for completeness:

String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
like image 50
Jason C Avatar answered Feb 07 '23 18:02

Jason C