I am trying to parse HTML dump of any given page. I used HTML Parser and also tried JSoup for parsing.
I found useful functions in Jsoup but I am getting 403 error while calling Document doc = Jsoup.connect(url).get();
I tried HTTPClient, to get the html dump and it was successful for the same url.
Why is JSoup giving 403 for the same URL which is giving content from commons http client? Am I doing something wrong? Any thoughts?
Working solution is as follows (Thanks to Angelo Neuschitzer for reminding to put it as a solution):
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements links = doc.getElementsByTag(HTML.Tag.CITE.toString);
for (Element link : links) {
String linkText = link.text();
System.out.println(linkText);
}
So, userAgent does the trick :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With