Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?

Tags:

htmlunit

I just want the text content of page and I want the fetching to be as lightweight as possible. Can I turn off all the parsing and additional loading of JavaScript, CSS and other external content that HTMLUnit does out of the box?

like image 204
Thomas Avatar asked Apr 10 '12 15:04

Thomas


1 Answers

I think the closest thing to what you're looking for is:

WebClient webClient = new WebClient();
webClient.setCssEnabled(false);
webClient.setAppletEnabled(false);
webClient.setJavaScriptEnabled(false);

For HtmlUnit 2.13 and above, use webclient.getOptions().

Also this question and answer might be useful too. It really made things faster for me, but I had to recompile HtmlUnit...

Finally, in order to get the original content of the page (instead of the output of asXml()) try the following:

WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage("http://www.yourpage.com");
String originalHtml = page.getWebResponse().getContentAsString();
like image 83
Mosty Mostacho Avatar answered Oct 03 '22 09:10

Mosty Mostacho