Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JTidy or Jsoup for Java

Recently I have been developing web scrapers in python with BeautifulSoup. Now I want to know which libraries are most preferred in Java. I have done some search, mostly I see JTidy and JSoup. What is the difference between them?

like image 961
torayeff Avatar asked Sep 15 '12 16:09

torayeff


1 Answers

JTidy is more commonly used to tidy the HTML, that is, to fix malformed or faulty HTML, such as unclosed tags, e.g., from <div><span>text</div> to <div><span>text</span></div.

JSoup, on the other hand, provides a full-blown API to parse HTML and to extract parts of it. It allows you to use jQuery like selectors to find elements, or DOM methods, equivalent to the ones you use with JavaScript, such as getElementById. I'd say JSoup is indeed the BeautifulSoup equivalent of Java.

For example, to extract the first paragraph of a Wikipedia article with JSoup, you could use the following:

String url = "http://en.wikipedia.org/wiki/Potato";
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
String firstParagraph = paragraphs.first().text();

Or to extract the title from this very own question:

Document doc = Jsoup.connect("http://stackoverflow.com/questions/12439078/jtidy-or-jsoup-for-java").get();
String question = doc.select("#question-header a").text(); // JTidy or Jsoup for Java

Quite a nice API, eh? :-)

like image 147
João Silva Avatar answered Sep 17 '22 22:09

João Silva