I'm looking for an efficient approach to extracting a fragment of HTML from a web page and performing some specific operations on that HTML fragment.
The operations required are:
I've been using HTML Parser (org.htmlparser) for this task and have been able to meet all of the requirements, however, I don't feel that I have an elegant solution. Currently, I am parsing the web page with a CssSelectorNodeFilter (to get the fragment) and then re-parsing that fragment with a NodeVisitor in order to carry out the cleaning operations.
Could anybody suggest how they would tackle this problem? I would prefer to only parse the document once and perform all operations during that one parse.
Thanks in advance!
Check out jsoup - it should handle all of your necessary tasks in an elegant way.
[Edit]
Here's a full working example per your required operations:
// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");
// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();
// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) {
for (Attribute attr : el.attributes()) {
String attrKey = attr.getKey();
if (attrKey.equals("style") || attrKey.startsWith("on")) {
el.removeAttr(attrKey);
}
}
}
// See also - doc.select("*").removeAttr("style");
You'll want to make sure things like case sensitivity don't matter for the attribute names but this should be the majority of what you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With