I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:
for (String locale : langList){
sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
try {
Document doc = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
} catch (IOException e) {
System.out.println(e);
}
}
Everything works perfectly most of the time. However there are a few things I want to be able to do.
First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.
Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.
Can someone with more knowledge than me help me out?
The above returns an IOException for me rather than the execute() returning the correct status code.
Using JSoup-1.6.1 I had to change the above code to use ignoreHttpErrors(true).
Now when the code returns the response rather than throwing an exception and you can check the error codes/messages.
Connection.Response response = null;
try {
response = Jsoup.connect(bad_url)
.userAgent("Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5")
.timeout(100000)
.ignoreHttpErrors(true)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
System.out.println("Status code = " + response.statusCode());
System.out.println("Status msg = " + response.statusMessage());
Output:
Status code = 404
Status msg = Not Found
For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:
Connection.Response response = Jsoup.connect(sitemapPath)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if(statusCode == 200) {
Document doc = connection.get();
Elements element = doc.select("loc");
for (Element urls : element) {
System.out.println(urls.text());
}
}
else {
System.out.println("received error code : " + statusCode);
}
Note that the execute()
method will fail with an IOException
if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x
response codes b/c Jsoup will set the status code from the final page fetched.
As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException
. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With