Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behavior when downloading html using HttpURLConnection

In my Wikipedia reader app for Android, I'm downloading an article's html by using HttpURLConnection, some users report that they are unable to see articles, instead they see some css, so it seems like their carrier is somehow preprocessing the html before it's downloaded, while other wikipedia readers seem to work fine.

Example url: http://en.m.wikipedia.org/wiki/Black_Moon_(album)

My method:

public static String downloadString(String url) throws Exception
{
    StringBuilder downloadedHtml = new StringBuilder(); 

    HttpURLConnection urlConnection = null;
    String line = null;
    BufferedReader rd = null;

    try
    {
        URL targetUrl = new URL(url);

        urlConnection = (HttpURLConnection) targetUrl.openConnection();

        if (url.toLowerCase().contains("/special"))
            urlConnection.setInstanceFollowRedirects(true);
        else
            urlConnection.setInstanceFollowRedirects(false);

        //read the result from the server
        rd = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));

        while ((line = rd.readLine()) != null)
            downloadedHtml.append(line + '\n');
    }
    catch (Exception e)
    {
        AppLog.e("An exception occurred while downloading data.\r\n: " + e);
        e.printStackTrace();
    }
    finally
    {
        if (urlConnection != null)
        {
            AppLog.i("Disconnecting the http connection");
            urlConnection.disconnect();
        }

        if (rd != null)
            rd.close();
    }

    return downloadedHtml.toString();
}

I'm unable to reproduce this problem, but there must be a way to get around that? I even disabled redirects by setting setInstanceFollowRedirects to 'false' but it didn't help.

Am I missing something?

Example of what the users are reporting:

http://pastebin.com/1E3Hn2yX

like image 945
jjdev80 Avatar asked Apr 03 '14 07:04

jjdev80


1 Answers

carrier is somehow preprocessing the html before it's downloaded

a way to get around that?

Use HTTPS to prevent carriers from rewriting pages. (no citation)

Am I missing something?

not that I can see

like image 165
guest Avatar answered Nov 13 '22 16:11

guest