I'm looking to gather information from a set of web pages that are all very similarly formatted. I need some information that is loaded onto the page by Javascript after opening. It seems that HTMLUnit is a pretty common tool to do this, so that's what I'm using. It's unfortunately very slow, which is a complaint I've seen across a lot of forums. The webClient.getPage() command is what is taking forever. When I turn off Javascript, it runs quickly, but I need to execute some Javascript commands. I was wondering, is there a way to selectively execute a few Javascript commands instead of all of them?
Alternatively, is there a program that is much faster than HTMLUnit for processing Javascript?
Sort of. You can programatically decide which external JavaScript URLs to load:
HtmlUnit will run all JS embedded on the page, if JavaScript is enabled. However, if certain external URLs are not required, you can choose to not load them.
Here's some code to get your started:
webClient.setWebConnection(new FalsifyingWebConnection(webClient) {
@Override
public WebResponse getResponse(WebRequest request) throws IOException {
if(request.getUrl().getPath().toLowerCase().equals("some url i don't need ")) {
return createWebResponse(request, "", "application/javascript");
}
return super.getResponse(request);
}
});
Setting the below might speed things up too:
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setIncorrectnessListener(new IncorrectnessListener() {
@Override
public void notify(String s, Object o) { }
});
webClient.getCookieManager().setCookiesEnabled(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With