I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251.
How can I convert this string to something readable?
I tried this:
import java.io.UnsupportedEncodingException;
public class Program {
public void run() throws UnsupportedEncodingException {
final String windows1251String = getWindows1251String();
System.out.println("String (Windows-1251): " + windows1251String);
final String readableString = convertString(windows1251String);
System.out.println("String (converted): " + readableString);
}
private String convertString(String windows1251String) throws UnsupportedEncodingException {
return new String(windows1251String.getBytes(), "UTF-8");
}
private String getWindows1251String() {
final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
return new String(bytes);
}
public static void main(final String[] args) throws UnsupportedEncodingException {
final Program program = new Program();
program.run();
}
}
The variable bytes contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes(). I just copy and pasted that array here.
This doesn't work - readableString contains garbage.
How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?
Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString to Windows-1251, nothing changes. See the screenshot below.

Update 2: Another attempt:

Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.

Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251, but UTF-8.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
System.out.println("Detected encoding: " + encoding);
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
(In the light of updates I deleted my original answer and started again)
The text which appears
пїЅпїЅпїЅпїЅпїЅпїЅ
is an accurate decoding of these byte values
-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67
(Padded at either end with 32, which is space.)
So either
1) The text is garbage or
2) The text is supposed to look like that or
3) The encoding is not Windows-1215
This line is notably wrong
return new String(windows1251String.getBytes(), "UTF-8");
Extracting the bytes out of a string and constructing a new string from that is not a way of "converting" between encodings. Both the input String and the output String use UTF-16 encoding internally (and you don't normally even need to know or care about that). The only times other encodings come into play are when text data is stored OUTSIDE of a string object - ie in your initial byte array. Conversion occurs when the String is constructed and then it is done. There is no conversion from one String type to another - they are all the same.
The fact that this
return new String(bytes);
does the same as this
return new String(bytes, "Windows-1251");
suggests that Windows-1251 is the platforms default encoding. (Which is further supported by your timezone being MSK)
I fixed this problem by modifying the piece of code, which read the text from the web site.
private String readContent(final String urlAsString) {
final StringBuilder content = new StringBuilder();
BufferedReader reader = null;
InputStream inputStream = null;
try {
final URL url = new URL(urlAsString);
inputStream = url.openStream();
reader =
new BufferedReader(new InputStreamReader(inputStream);
String inputLine;
while ((inputLine = reader.readLine()) != null) {
content.append(inputLine);
}
} catch (final IOException exception) {
exception.printStackTrace();
} finally {
IOUtils.closeQuietly(reader);
IOUtils.closeQuietly(inputStream);
}
return content.toString();
}
I changed the line
new BufferedReader(new InputStreamReader(inputStream);
to
new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));
and then it worked.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With