Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Characters "æ", "ø" "æ" in Java Strings (Windows)

Tags:

java

utf-8

For some reason a String that is assigned the letter å by using the Scanner class does not equal a String that is assigned å by using the "normal" way: String a = "å" - Why is that?

import java.util.*;

public class UTF8Test {
public static void main(String [] args) {

    String [] Norge = {"løk", "hår", "vår", "sær", "søt"};

    Scanner input = new Scanner(System.in);

    String  test = input.nextLine();  //I enter løk here
    System.out.println(test);
    System.out.println(Norge[0]);

    for(int i = 0; i < Norge.length; i++) {
        if(Norge[i].equals(test) ) {
            System.out.println("YES!!");
        }
    }
}
}

The compiler will show this:

løk

løk

l├©k

like image 208
Sing Sandibar Avatar asked Nov 13 '13 15:11

Sing Sandibar


People also ask

How do I find a character in a string in Java?

Java String indexOf() MethodThe indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.

How do you check if a character in a string is alphabetic Java?

In order to check if a String has only Unicode letters in Java, we use the isDigit() and charAt() methods with decision-making statements. The isLetter(int codePoint) method determines whether the specific character (Unicode codePoint) is a letter. It returns a boolean value, either true or false.


2 Answers

Provided that your sole requirement is being able to use UTF-8 everywhere as indicated by the UTF8Test classname, then your main mistake is that you're using Windows command console to compile and run your Java program. The ├© as mojibaked form of ø namely strongly suggests that you were using CP850 encoding to compile your Java source code file. As evidence, run this in an UTF-8 capable environment:

System.out.println(new String("ø".getBytes("UTF-8"), "CP850"));

This prints ├©. This in turn strongly suggests that you were using Windows command console to compile your Java source code file as that's currently the only commonly used environment which uses CP850 by default. However, the Windows command console is not UTF-8 capable.

When you save (convert from chars to bytes) the source code file using UTF-8 encoding in your text editor, then the ø character is turned into the bytes 0xC3 and 0xB8 (as evidence, see "UTF-8 (hex)" entry in U+00F8 character info). When you run javac UTF8Test.java, then the UTF-8 saved source code file is basically read (converted from bytes to characters) using CP850 encoding. The bytes 0xC3 and 0xB8 represent in this encoding the characters and © (as evidence, see CP850 codepage layout). This totally explains your initial problem.

True, you can instruct javac to read the source code file using UTF-8 by the -encoding UTF-8 argument. However, the Windows command console at its whole own does not support UTF-8 flavored input and output at all. When you recompile using -encoding UTF-8, then you would still get mojibaked output because the command console can't properly represent UTF-8 output. I tried it here and I got a degree symbol instead:

løk
l°k

This problem is not solveable if you intend to use UTF-8 everywhere and want to stick to Windows command console as input/output environment. Basically, you need an UTF-8 capable input/output environment. Decent IDEs like Eclipse and Netbeans are such ones. Or, if you intend to run it as an UTF-8 capable standalone program, using a Swing UI should be preferred over a GUI-less console program.

like image 180
BalusC Avatar answered Oct 20 '22 15:10

BalusC


If you want to have a string literal with a special character, you can try using a Unicode escape:

String [] Norge = {"l\u00F8k", "h\u00E5r", "v\u00E5r", "s\u00E6r", "s\u00F8t"};

While it is not wrong to include special characters in source code (at least in java), it can in some cases cause problems with poorly configured editors, compilers, or terminals; Personally I steer clear of using special characters at all if I can.

Incidentally, you can also use Unicode escapes elsewhere in java source code, including javadoc comments, and class, method, and variable names.

If you are compiling from the command line, you can configure the compiler to accept UTF-8 by using the -encoding option with UTF-8 as its parameter. Like so:

javac -encoding UTF-8 ...

You may also find this question useful: Special Character in Java


You might consider externalizing the strings, as an alternate way to solve the problem. Eclipse provides a way to automatically do this, but it basically just takes all the literal strings, puts them in a separate file, and reads from that file to get the appropriate string. This also allows you to create a translation of the program, by making a different file with translations of all the strings, or to reconfigure application messages without having to recompile.


EDIT: I just tried compiling and running it myself (in eclipse), and I did not have the problem with it you mention. It is therefore likely an issue with your particular setup.

When I reconfigured it to compile the code as US-ASCII, it output l?k both times.

When I reconfigured it to compile the code as UTF-8, the output was løk and løk.

When I compiled it as UTF-16, the output was þÿ l ø k and løk, however I could not copy the blank spaces in þÿ l ø k from the terminal: it would let me copy the first two, but leave off the rest. This is probably related to the issue you were having - they could be some control characters that are messing it up in your case.

like image 38
AJMansfield Avatar answered Oct 20 '22 16:10

AJMansfield