Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 does not print characters to the console

I have the following code

public class MainDefault {
        public static void main (String[] args) {
                System.out.println("²³");
                System.out.println(Arrays.toString("²³".getBytes()));
        }
}

But can't seem to print the special characters to the console

When I do the following, I get the following result

$ javac MainDefault.java
$ java MainDefault

MainDefaultPrinting

On the other hand, when I compile it and run it like this

$ javac -encoding UTF8 MainDefault.java
$ java MainDefault

MainDefaultUTF8CompilationOnly

And when I run it using the file encoding UTF8 flag, I get the following

$ java -Dfile.encoding=UTF8 MainDefault

MainDefaultUTF8CompilationAndRun

It's doesn't seem to be a problem with the console (Git Bash on Windows 10), as it prints the characters normally

Echo

Thanks for your help

like image 622
Yassin Hajaj Avatar asked Sep 02 '20 19:09

Yassin Hajaj


People also ask

What characters are not allowed in UTF-8?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.


2 Answers

Your code are not printing the right characters in the console because your Java program and the console are using different character sets, different encodings.

If you want to obtain the same characters, you first need to determine which character sets are in place.

This process will depend on the "console" in which you are outputting your results.

If you are working with Windows and cmd, as @RickJames suggested, you can use the chcp command to determine the active code page.

Oracle provides the Java full supported encodings information, and the correspondence with other alias - code pages in this case - in this page.

This stackoverflow answer also provides some guidance about the mapping between Windows Code Pages and Java charsets.

As you can see in the provided links, the code page for UTF-8 is 65001.

If you are using Git Bash (MinTTY), you can follow @kriegaex instructions to verify or configure UTF-8 as the terminal emulator encoding.

Linux and UNIX, or UNIX derived systems like Mac OS, do not use code page identifiers, but locales. The locale information can vary between systems, but you can either use the locale command or try to inspect the LC_* system variables to find the required information.

This is the output of the locale command in my system:

LANG="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_CTYPE="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_ALL=

Once you know this information, you need to run your Java program with the file.encoding VM option corresponding to the right charset:

java -Dfile.encoding=UTF8 MainDefault

Some classes, like PrintStream or PrintWriter, allows you to indicate the Charset in which the information will be outputted.

The -encoding javac option only allows you to specify the character encoding used by source files.

If you are using Windows with Git Bash, consider also reading this @rmunge answer: it provides information about a possible bug in the tool that may be the reason for the problem and that prevents the terminal from running correctly out of the box without the need for manual encoding adjustments.

like image 167
jccampanero Avatar answered Oct 09 '22 17:10

jccampanero


I am also using the Git Bash on Windows 10 and It works totally fine for me.

Here's how it prints,

Trying to reproduce it in Git Bash on Windows 10

Terminal version is mintty 3.0.2 (x86_64-pc-msys) and My text properties were,

enter image description here

So, I tried to reproduce your outputs by changing Character Sets;

enter image description here

By setting Character Set to CP437 (OEM codepage) (Note that this automatically changed Locale to C too), I could be able to get the output as you got.

enter image description here

And then after when I change it back to UTF-8 (Unicode), the I could get the output as expected!

enter image description here

Therefore, it is clear that the problem is with your console's Character Set.

like image 23
Tharindu Sathischandra Avatar answered Oct 09 '22 15:10

Tharindu Sathischandra