For some reason a String that is assigned the letter å by using the Scanner class does not equal a String that is assigned å by using the "normal" way: String a = "å"
- Why is that?
import java.util.*;
public class UTF8Test {
public static void main(String [] args) {
String [] Norge = {"løk", "hår", "vår", "sær", "søt"};
Scanner input = new Scanner(System.in);
String test = input.nextLine(); //I enter løk here
System.out.println(test);
System.out.println(Norge[0]);
for(int i = 0; i < Norge.length; i++) {
if(Norge[i].equals(test) ) {
System.out.println("YES!!");
}
}
}
}
The compiler will show this:
løk
løk
l├©k
Java String indexOf() MethodThe indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.
In order to check if a String has only Unicode letters in Java, we use the isDigit() and charAt() methods with decision-making statements. The isLetter(int codePoint) method determines whether the specific character (Unicode codePoint) is a letter. It returns a boolean value, either true or false.
Provided that your sole requirement is being able to use UTF-8 everywhere as indicated by the UTF8Test
classname, then your main mistake is that you're using Windows command console to compile and run your Java program. The ├©
as mojibaked form of ø
namely strongly suggests that you were using CP850 encoding to compile your Java source code file. As evidence, run this in an UTF-8 capable environment:
System.out.println(new String("ø".getBytes("UTF-8"), "CP850"));
This prints ├©
. This in turn strongly suggests that you were using Windows command console to compile your Java source code file as that's currently the only commonly used environment which uses CP850 by default. However, the Windows command console is not UTF-8 capable.
When you save (convert from chars to bytes) the source code file using UTF-8 encoding in your text editor, then the ø
character is turned into the bytes 0xC3
and 0xB8
(as evidence, see "UTF-8 (hex)" entry in U+00F8 character info). When you run javac UTF8Test.java
, then the UTF-8 saved source code file is basically read (converted from bytes to characters) using CP850 encoding. The bytes 0xC3
and 0xB8
represent in this encoding the characters ├
and ©
(as evidence, see CP850 codepage layout). This totally explains your initial problem.
True, you can instruct javac
to read the source code file using UTF-8 by the -encoding UTF-8
argument. However, the Windows command console at its whole own does not support UTF-8 flavored input and output at all. When you recompile using -encoding UTF-8
, then you would still get mojibaked output because the command console can't properly represent UTF-8 output. I tried it here and I got a degree symbol instead:
løk l°k
This problem is not solveable if you intend to use UTF-8 everywhere and want to stick to Windows command console as input/output environment. Basically, you need an UTF-8 capable input/output environment. Decent IDEs like Eclipse and Netbeans are such ones. Or, if you intend to run it as an UTF-8 capable standalone program, using a Swing UI should be preferred over a GUI-less console program.
If you want to have a string literal with a special character, you can try using a Unicode escape:
String [] Norge = {"l\u00F8k", "h\u00E5r", "v\u00E5r", "s\u00E6r", "s\u00F8t"};
While it is not wrong to include special characters in source code (at least in java), it can in some cases cause problems with poorly configured editors, compilers, or terminals; Personally I steer clear of using special characters at all if I can.
Incidentally, you can also use Unicode escapes elsewhere in java source code, including javadoc comments, and class, method, and variable names.
If you are compiling from the command line, you can configure the compiler to accept UTF-8 by using the -encoding
option with UTF-8
as its parameter. Like so:
javac -encoding UTF-8 ...
You may also find this question useful: Special Character in Java
You might consider externalizing the strings, as an alternate way to solve the problem. Eclipse provides a way to automatically do this, but it basically just takes all the literal strings, puts them in a separate file, and reads from that file to get the appropriate string. This also allows you to create a translation of the program, by making a different file with translations of all the strings, or to reconfigure application messages without having to recompile.
EDIT: I just tried compiling and running it myself (in eclipse), and I did not have the problem with it you mention. It is therefore likely an issue with your particular setup.
When I reconfigured it to compile the code as US-ASCII, it output l?k
both times.
When I reconfigured it to compile the code as UTF-8, the output was løk
and løk
.
When I compiled it as UTF-16, the output was þÿ l ø k
and løk
, however I could not copy the blank spaces in þÿ l ø k
from the terminal: it would let me copy the first two, but leave off the rest. This is probably related to the issue you were having - they could be some control characters that are messing it up in your case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With