Jsoup 404 error

Question

I am new with Jsoup but I can't understand why I receive a 404 error when trying to obtain a page, even if the page is accessible from browser and I don't use any proxys. I have tried with the following code:

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

and I receive the exception message:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)

Alkis Kalogeris · Accepted Answer

It seems that the site doesn't allow bots and it will throw a 404 error response in case it doesn't locate the User-Agent headers. The below works as it sets the user agent headers

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com")              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

User Agent

The Hypertext Transfer Protocol (HTTP) identifies the client software originating the request, using a "User-Agent" header, even when the client is not operated by a user.

Referrer (I don't think this is necessary)

HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested.

Just to provide full service I would advise you to set the timeout period for your requests. The default is 3 seconds, if the server takes longer than that you will receive an exception. Bellow follows your code with timeout setter. Set it to zero for the longest possible period.

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com") 
               .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

Udit Kapahi · Answer

If in case you are getting response code 404 , you can skip that url

Use ignoreHttpErrors(true), will surely solve your problem

Document doc3 = null;
    try {
        doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                .referrer("http://www.google.com").ignoreHttpErrors(true).get();

    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Jsoup 404 error

Tags:

java

html

http-status-code-404

connection

jsoup

mawus

2 Answers

Alkis Kalogeris

Udit Kapahi

Recent Activity

Donate For Us

Jsoup 404 error

Tags:

java

html

http-status-code-404

connection

jsoup

mawus

2 Answers

Alkis Kalogeris

Udit Kapahi

Related questions

Recent Activity

Donate For Us