Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert String encoded in windows-1250/Cp1250 to utf-8?

Tags:

java

string

As title say ... I read content from htto response

 

    InputStream is = response.getEntity().getContent();
    String cw = IOUtils.toString(is);
    byte[] b = cw.getBytes("Cp1250");
    String x = StringUtils.newStringUtf8(b);
    String content = new String(b, "UTF-8");

    System.out.println(content);

 

I have tried plenty of variations. I am little confused about what are correct encoding constants used as strings. windows-1250 or Cp1250. UTF-8 or utf-8 or utf8?

like image 986
falconseye Avatar asked Jul 07 '12 18:07

falconseye


3 Answers

Encoding have a canonical (unique) name and other varying names, and that case-insensitive. For instance "UTF-8" is the canonical name, but some java versions back it was "UTF8"; it got written more to the common usage. The same for "Windows-1250," which you might see also in HTML pages. "Cp1250" (Code-Page) is a java internal name.

In java byte[] is binary data, String (internally Unicode) is text. Conversion between both needs an encoding, often optional though, taking the operating system default.

byte, InputStream, OutputStream <-> String, char, Reader, Writer

String cw = IOUtils.toString(is, "UTF-8"); // InputStream is binary gives byte[], hence give encoding
byte[] b = cw.getBytes("Cp1250");
String x = new String(b, "Cp1250");
String content = s;

System.out.println(content);

To allow this universal (qua encoding) String, String internally uses char, UTF-16. String constants are stored in the .class file as UTF-8 (more compact).

like image 28
Joop Eggen Avatar answered Oct 20 '22 00:10

Joop Eggen


You seem to think that a String object has an encoding. That's not correct. An encoding is used as part of the translation from binary data (a byte[] or InputStream) to text data (a String or char[] etc).

It's not clear what IOUtils.toString is doing, but it's almost certainly losing data or at least handling it inappropriately. If your data is originally in Windows-1250, then you should use an InputStreamReader wrapping the InputStream, specifying the charset in the InputStreamReader constructor call.

It's not clear where UTF-8 comes in - you might want to write out the data in UTF-8 afterwards, but the result of that would be byte[], not a string.

like image 91
Jon Skeet Avatar answered Oct 20 '22 01:10

Jon Skeet


You're converting backwards. You need to get the input data as a byte array and then use String(byteArray, "Cp1250") to create the String object. Then if you want UTF-8, use String.getBytes("UTF-8").

like image 20
Hot Licks Avatar answered Oct 20 '22 00:10

Hot Licks