Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup 404 error

I am new with Jsoup but I can't understand why I receive a 404 error when trying to obtain a page, even if the page is accessible from browser and I don't use any proxys. I have tried with the following code:

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

and I receive the exception message:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)
like image 416
mawus Avatar asked Jun 29 '14 11:06

mawus


2 Answers

It seems that the site doesn't allow bots and it will throw a 404 error response in case it doesn't locate the User-Agent headers. The below works as it sets the user agent headers

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com")              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}

User Agent

The Hypertext Transfer Protocol (HTTP) identifies the client software originating the request, using a "User-Agent" header, even when the client is not operated by a user.


Referrer (I don't think this is necessary)

HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage (i.e. the URI or IRI) that linked to the resource being requested.

Just to provide full service I would advise you to set the timeout period for your requests. The default is 3 seconds, if the server takes longer than that you will receive an exception. Bellow follows your code with timeout setter. Set it to zero for the longest possible period.

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com") 
               .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
} 
like image 106
Alkis Kalogeris Avatar answered Nov 04 '22 07:11

Alkis Kalogeris


If in case you are getting response code 404 , you can skip that url

Use ignoreHttpErrors(true), will surely solve your problem

Document doc3 = null;
    try {
        doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                .referrer("http://www.google.com").ignoreHttpErrors(true).get();

    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
like image 13
Udit Kapahi Avatar answered Nov 04 '22 07:11

Udit Kapahi