Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

htmlunit Cannot read property "push" from undefined

I'm trying to crawl a website using htmlunit. Whenever I run it though it only outputs the following error:

Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "push" from undefined (https://www.kinoheld.de/dist/prod/0.4.7/widget.js#1)

Now I don't know much about JS, but I read that push is some kind of array operation. This seems standard to me and I don't know why it would not be supported by htmlunit.

Here is the code I'm using so far:

public static void main(String[] args) throws IOException {
    WebClient web = new WebClient(BrowserVersion.FIREFOX_45);
    web.getOptions().setUseInsecureSSL(true);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";
    web.getOptions().setThrowExceptionOnFailingStatusCode(false);
    web.waitForBackgroundJavaScript(9000);
    HtmlPage response = web.getPage(url);

    System.out.println(response.getTitleText());
}

What am I missing? Is there a way around this or a way to fix this? Thanks in advance!

like image 875
Maverick283 Avatar asked Nov 17 '16 14:11

Maverick283


2 Answers

Try adding

web.getOptions().setThrowExceptionOnScriptError(false);

before you try to get the page. This forces htmlunit to ignore the error. However, this might not work 100% of the time if for instance the javascript that throws the error is important to get the data you are scrapping (which it hopefully isn't). If that doesn't work, try using Selenium with ChromeDriver or GhostDriver.

Source

like image 140
GenuinePlaceholder Avatar answered Nov 06 '22 23:11

GenuinePlaceholder


I've encountered a similar problem before. This is an issue with HTML Unit being designed as a test harness framework rather than a web scraping one. Are you running the latest version of HTML Unit?

I was able to run your code by adding both the setThrowExceptionOnScriptError(false) (as mentioned in Coffee Converter's answer) line as well as adding java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); at the top of the method to disable the log dump. This yielded an output of:

Royal Filmpalast München München | kinoheld.de

Full code is as follows:

public static void main(String[] args) throws IOException {

    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

    WebClient webClient = new WebClient(BrowserVersion.FIREFOX_45);
    String url = "https://www.kinoheld.de/kino-muenchen/royal-filmpalast/vorstellung/280823/?mode=widget&showID=280828#panel-seats";

    webClient.getOptions().setUseInsecureSSL(true);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    webClient.waitForBackgroundJavaScript(9000);
    HtmlPage response = webClient.getPage(url);

    System.out.println(response.getTitleText());
}

This was run on RedHat command line with HTML Unit 2.2.1. Hope this helps.

like image 30
Jack Avatar answered Nov 06 '22 23:11

Jack