Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TagSoup vs. Jsoup vs. HTML Parser vs. HotSax vs [closed]

The abundance of HTML parsers to choose from (and stick with) is mind boggling:

http://java-source.net/open-source/html-parsers

How do I choose one that best suits the following requirements:

  1. Mature (fewer bugs than the rest)
  2. Live and breathing (i.e. being maintained)
  3. Fast and resource-efficient (intended to run on Android)

Based on your experience, which HTML parser would you recommend (for meeting the above requirements) and why?

like image 471
Regex Rookie Avatar asked Mar 03 '11 16:03

Regex Rookie


People also ask

Is jsoup deprecated?

Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .

What is jsoup parse?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is the use of jsoup?

Jsoup is a Java html parser. It is a Java library that is used to parse html documents. Jsoup gives programming interface to concentrate and control information from URL or HTML documents. It utilizes DOM, CSS and Jquery-like systems for concentrating and controlling records.

Does jsoup support JavaScript?

You can extract data by using CSS selectors, or by navigating and modifying the Document Object Model directly - just like a browser does, except you do it in Java code. You can also modify and write HTML out safely too. jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.


1 Answers

Well, I found the answer, which was given by @BalusC on a different thread:

  1. If you just want to use a XML based tool to traverse it: JTidy.
  2. If you like to unit test the HTML: HtmlUnit
  3. If you like to extract specific data from the HTML: Jsoup

Thank you @BalusC.

like image 118
Regex Rookie Avatar answered Oct 12 '22 04:10

Regex Rookie