Java UTF-8 encoding not set to URLConnection

Tags:

I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47

As you can see in text there are sings like this: /ælˈdʒɪəriə/.

When I try to get source from the page I get text with sings like ú etc.

So far I've tried with the following code:

urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

What am I doing wrong?

My entire code:

Click to copy

URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}

try {
    urlConn = url.openConnection(); 
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");

urlConn.setDoInput(true);
urlConn.setUseCaches(false);

StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
    strBseznam.deleteCharAt(strBseznam.length() - 1);

try {
    input = new DataInputStream(urlConn.getInputStream()); 
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
    while (null != ((str = input.readLine()))) 
    {
        strB.append(str); 
    }
    input.close();
} catch (IOException e) { e.printStackTrace(); }

455

asked Jan 19 '12 23:01

2 Answers

Try adding also the user agent to your URLConnection:

Click to copy

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");

This solved my decoding problem like a charm.

answered Oct 14 '22 11:10

The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.

You have to decode the entities yourself. Something like:

Click to copy

String decodeNumericEntities(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
    while (m.find()) {
        int uc = Integer.parseInt(m.group(1));
        m.appendReplacement(sb, "");
        sb.appendCodepoint(uc);
    }
    m.appendTail(sb);
    return sb.toString();
}

By the way those entities could stem from processed HTML forms, so on the editing side of the web app.

After code in question:

I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.

Click to copy

try {
    BufferedReader input = new BufferedReader(
            new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
    StringBuilder strB = new StringBuilder();
    String str;
    while (null != (str = input.readLine())) {
        strB.append(str).append("\r\n"); 
    }
    input.close();
} catch (IOException e) {
    e.printStackTrace();
}

answered Oct 14 '22 13:10

Joop Eggen

Related questions
                            
                                Getting fonts, sizes, bold,...etc
                            
                                Bring a component on a JPanel to front (Java)
                            
                                Android listview no longer highlights selection onclick
                            
                                How to display a temporary baloon tooltip during input validation?
                            
                                Is mockito supposed to call default constructor of mocked class?
                            
                                Gets the uncompressed size of this GZIPInputStream?
                            
                                Difference between ArrayList<>() and ArrayList<>(){}
                            
                                How to get Maven project BaseDir() from java Code [closed]
                            
                                Convert Word to HTML with Apache POI
                            
                                Make exceptions more informative
                            
                                SWIG (v1.3.29) generated C++ to Java Vector class not acting properly
                            
                                Something keeps killing my Java process on Ubuntu, anyone know why?
                            
                                JmDNS service discovery in client-server
                            
                                Java: How to distinguish between spurious wakeup and timeout in wait()
                            
                                Matlab cannot see some of my java classes (not all) in jar package
                            
                                Mockito: Mocking "Blackbox" Dependencies
                            
                                Create a Timestamp without timeZone
                            
                                Accessing inherited class variables in java
                            
                                Programmatically getting the Maven version of your project
                            
                                Refactoring code in Java, alternatives to large if statement

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java UTF-8 encoding not set to URLConnection

Tags:

java

unicode

utf8-decode

Ales

People also ask

2 Answers

limlim

Joop Eggen

Recent Activity

Donate For Us