Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a string to UTF-8 in Android?

I am using a HTML parser called Jsoup, to load and parse HTML files. The problem is that the webpage I'm scraping is encoded in ISO-8859-1 charset while Android is using UTF-8 encoding(?). This is results in some characters showing up as question marks.

So now I guess I should convert the string to UTF-8 format.

Now I have found this Class called CharsetEncoder in the Android SDK, which I guess could help me. But I can't figure out how to implement it in practice, so I wonder if could get som help with by a practical example.

UPDATE: Code to read data (Jsoup)

url = new URL("http://www.example.com");
Document doc = Jsoup.parse(url, 4000);
like image 385
droidgren Avatar asked Jul 01 '10 21:07

droidgren


People also ask

How do I convert String to UTF?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

What is a UTF-8 String?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Is Java a UTF-8 String?

String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors.


2 Answers

You can let Android do the work for you by reading the page into a byte[] and then using the jSoup methods for parsing String objects.

Don't forget to specify the encoding when you create the string from the data read from the server using the correct String constructor.

like image 156
Al Sutton Avatar answered Nov 14 '22 23:11

Al Sutton


Byte encodings and Strings

public static void main(String[] args) {

      System.out.println(System.getProperty("file.encoding"));
      String original = new String("A" + "\u00ea" + "\u00f1"
                                 + "\u00fc" + "C");

      System.out.println("original = " + original);
      System.out.println();

      try {
          byte[] utf8Bytes = original.getBytes("UTF8");
          byte[] defaultBytes = original.getBytes();

          String roundTrip = new String(utf8Bytes, "UTF8");
          System.out.println("roundTrip = " + roundTrip);

          System.out.println();
          printBytes(utf8Bytes, "utf8Bytes");
          System.out.println();
          printBytes(defaultBytes, "defaultBytes");
      } catch (UnsupportedEncodingException e) {
          e.printStackTrace();
      }

   } // main
like image 24
droidgren Avatar answered Nov 14 '22 21:11

droidgren