Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URLConnection does not get the charset

I'm using URL.openConnection() to download something from a server. The server says

Content-Type: text/plain; charset=utf-8

But connection.getContentEncoding() returns null. What up?

like image 389
Bart van Heukelom Avatar asked Oct 14 '10 14:10

Bart van Heukelom


People also ask

What is the difference between URLConnection and HttpURLConnection?

URLConnection is the base class. HttpURLConnection is a derived class which you can use when you need the extra API and you are dealing with HTTP or HTTPS only. HttpsURLConnection is a 'more derived' class which you can use when you need the 'more extra' API and you are dealing with HTTPS only.

What is URL and URLConnection in Java?

The Java URLConnection class represents a communication link between the URL and the application. It can be used to read and write data to the specified resource referred by the URL.

What is URLConnection?

URLConnection is an abstract class whose subclasses form the link between the user application and any resource on the web. We can use it to read/write from/to any resource referenced by a URL object. There are mainly two subclasses that extend the URLConnection class.


3 Answers

The value returned from URLConnection.getContentEncoding() returns the value from header Content-Encoding

Code from URLConnection.getContentEncoding()

/**
     * Returns the value of the <code>content-encoding</code> header field.
     *
     * @return  the content encoding of the resource that the URL references,
     *          or <code>null</code> if not known.
     * @see     java.net.URLConnection#getHeaderField(java.lang.String)
     */
    public String getContentEncoding() {
       return getHeaderField("content-encoding");
    }

Instead, rather do a connection.getContentType() to retrieve the Content-Type and retrieve the charset from the Content-Type. I've included a sample code on how to do this....

String contentType = connection.getContentType();
String[] values = contentType.split(";"); // values.length should be 2
String charset = "";

for (String value : values) {
    value = value.trim();

    if (value.toLowerCase().startsWith("charset=")) {
        charset = value.substring("charset=".length());
    }
}

if ("".equals(charset)) {
    charset = "UTF-8"; //Assumption
}
like image 180
Buhake Sindi Avatar answered Oct 12 '22 23:10

Buhake Sindi


This is documented behaviour as the getContentEncoding() method is specified to return the contents of the Content-Encoding HTTP header, which is not set in your example. You could use the getContentType() method and parse the resulting String on your own, or possibly go for a more advanced HTTP client library like the one from Apache.

like image 27
Waldheinz Avatar answered Oct 12 '22 23:10

Waldheinz


Just as an addition to the answer from @Buhake Sindi. If you are using Guava, instead of the manual parsing you can do:

MediaType mediaType = MediaType.parse(httpConnection.getContentType());
Optional<Charset> typeCharset = mediaType.charset();
like image 33
Juan M. Rivero Avatar answered Oct 12 '22 22:10

Juan M. Rivero