Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)

Question

I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.

The operations required are:

Remove all tags that have a class of "hidden"
Remove all script tags
Remove all style tags
Remove all event attributes (on*="*")
Remove all style attributes

I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.

Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.

Thanks in advance!

maerics · Accepted Answer

Check out jsoup - it should handle all of your necessary tasks in an elegant way.

[Edit]

Here's a full working example per your required operations:

// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");

// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();

// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) { 
  for (Attribute attr : el.attributes()) { 
    String attrKey = attr.getKey();
    if (attrKey.equals("style") || attrKey.startsWith("on")) { 
      el.removeAttr(attrKey);
    } 
  }
}
// See also - doc.select("*").removeAttr("style");

You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.

Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)

Tags:

java

html-parsing

software-design

Kieran Hall

1 Answers

maerics

Recent Activity

Donate For Us

Extract and Clean HTML Fragment using HTML Parser (org.htmlparser)

Tags:

java

html-parsing

software-design

Kieran Hall

1 Answers

maerics

Related questions

Recent Activity

Donate For Us