Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting cleaned HTML in text from HtmlCleaner

I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it?

Thanks Nayn

like image 350
Nayn Avatar asked Aug 25 '11 19:08

Nayn


2 Answers

Here's the sample code:

HtmlCleaner htmlCleaner = new HtmlCleaner();

TagNode root = htmlCleaner.clean(url);

HtmlCleaner.getInnerHtml(root);

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";
like image 117
Rahul Sainani Avatar answered Oct 01 '22 13:10

Rahul Sainani


Use a subclass of org.htmlcleaner.XmlSerializer, for example:

// get the element you want to serialize
HtmlCleaner cleaner     = new HtmlCleaner();
TagNode     rootTagNode = cleaner.clean(url);

// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
cleanerProperties.setOmitXmlDeclaration(true);

// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String        html          = xmlSerializer.getAsString(rootTagNode);
like image 29
luiss Avatar answered Oct 01 '22 11:10

luiss