Recently I have been developing web scrapers in python with BeautifulSoup. Now I want to know which libraries are most preferred in Java. I have done some search, mostly I see JTidy and JSoup. What is the difference between them?
JTidy
is more commonly used to tidy the HTML, that is, to fix malformed or faulty HTML, such as unclosed tags, e.g., from <div><span>text</div>
to <div><span>text</span></div
.
JSoup
, on the other hand, provides a full-blown API to parse HTML and to extract parts of it. It allows you to use jQuery like selectors to find elements, or DOM
methods, equivalent to the ones you use with JavaScript, such as getElementById
. I'd say JSoup is indeed the BeautifulSoup equivalent of Java.
For example, to extract the first paragraph of a Wikipedia article with JSoup, you could use the following:
String url = "http://en.wikipedia.org/wiki/Potato";
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
String firstParagraph = paragraphs.first().text();
Or to extract the title from this very own question:
Document doc = Jsoup.connect("http://stackoverflow.com/questions/12439078/jtidy-or-jsoup-for-java").get();
String question = doc.select("#question-header a").text(); // JTidy or Jsoup for Java
Quite a nice API, eh? :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With