Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing utf-8 strings in java

Tags:

java

unicode

In my java program, I am retrieving some data from xml. This xml has few international characters and is encoded in utf8. Now I read this xml using xml parser. Once I retrieve a particular international string from xml parser, I need to compare it with set of predefined strings. Problem is when I use string.equals on internatinal string comparison fails.

How to compare strings with international strings in java ? I am using SAXParser & XMLReader to read strings from xml.

Here's the line that compares strings

 String country;
 country = getXMLNodeString();

 if(country.equals("Côte d'Ivoire"))
 {    

 } 

  getXMLNodeString()
  {

  /* Get a SAXParser from the SAXPArserFactory. */  
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();

        /* Get the XMLReader of the SAXParser we created. */
        XMLReader xr = sp.getXMLReader();
        /* Create a new ContentHandler and apply it to the XML-Reader*/
        XmlParser xmlParser = new XmlParser();  //my class to parse xml
        xr.setContentHandler(xmlParser);  

        /* Parse the xml-data from our URL. */
        xr.parse(new InputSource(url.openStream()));
        /* Parsing has finished. */


       //return string here
  }
like image 530
cppdev Avatar asked May 08 '10 02:05

cppdev


Video Answer


3 Answers

Java stores Strings internally as an array of chars, which are 16 bit unsigned values. This was based on an earlier Unicode standard that supported 64K characters.

Your String constant "Côte d'Ivoire" is in this format. If your character encoding on your XML document is correct then the String read from there will also be in the correct format. So possible errors are:

  1. The XML document doesn't declare a character encoding;

  2. The declared character encoding does not match the actual character encoding used.

Perhaps the XML string is being treated as US-ASCII instead of UTF-8. I would output both and eyeball them. If they look the same, compare them character by character to see where teh comparison fails. You may also want to compare the UTF8 encoding of your constant String to what's in the XML document:

byte[] bytes = "Côte d'Ivoire".getBytes("UTF-8");

It gets more complicated when you start getting into "supplementary characters". These are characters beyond the originally intended 64K ("code points" in Unicode parlance). See Supplementary Characters in the Java Platform. This isn't an issue with any of the characters you're using but it's worth noting for completeness.

like image 70
cletus Avatar answered Oct 01 '22 03:10

cletus


Since you're comparing with a string literal, you need to make sure that you're saving your source file in the same encoding that javac is expecting. You can also specify what encoding your source files are in with the -encoding argument to javac.

That seems like the most likely "gotcha" in this scenario.

Note that I'm talking about the encoding of your Java source code, not the XML document.

like image 41
John Flatness Avatar answered Oct 01 '22 03:10

John Flatness


Java strings are always UTF-16. Your XML parser should be converting the file's UTF-8 characters into UTF-16 while reading, and your own strings are already UTF-16 in memory, so you can compare them with an ordinary equals() call. If they aren't comparing equal when you think they should, the problem is likely something else.

like image 32
Wyzard Avatar answered Oct 01 '22 02:10

Wyzard