I'm very new to HtmlUnit and I'm trying to scrape a website that uses Javascript to edit the code. I heard HtmlUnit was the best way to go as it returns the final code using a headless browser.
However as you will see I cannot even get past creating a HtmlPage object without getting a huge and impossible to understand exception thrown (at least given my virtually null experience with HtmlUnit).
Here is my code:
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Main {
public static void main(String[] args) {
Main scraper = new Main();
scraper.testingGargoyle();
}
private void testingGargoyle() {
String myUrl = "https://www.wearvr.com/#game_id=game_4";
WebClient webClient = new WebClient();
try {
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
} catch (FailingHttpStatusCodeException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
And this is the exception that gets thrown :
Apr 30, 2015 5:43:50 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Apr 30, 2015 5:43:50 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[https://load.sumome.com/] line=[1] lineSource=[null] lineOffset=[0]
Exception in thread "main" ======= EXCEPTION START ========
EcmaError: lineNumber=[19] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js] message=[TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:847)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:620)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:733)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1096)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:395)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:270)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:290)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:793)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:751)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1017)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:248)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:194)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:471)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:345)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:410)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:395)
at Main.testingGargoyle(Main.java:19)
at Main.main(Main.java:10)
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3629)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3613)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3634)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3650)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.notFunctionError(ScriptRuntime.java:3714)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2233)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2215)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1333)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3057)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:724)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:832)
... 31 more
Enclosed exception:
net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3629)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3613)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3634)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3650)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.notFunctionError(ScriptRuntime.java:3714)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2233)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2215)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1333)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:19)
at script.r(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
at script.r(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:384)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:7)
at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:463)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:463)
at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3057)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:724)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:832)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:620)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:733)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1096)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:395)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:270)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:290)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:793)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:751)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1017)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:248)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:194)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:471)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:345)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:410)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:395)
at Main.testingGargoyle(Main.java:19)
at Main.main(Main.java:10)
======= EXCEPTION END ========
I told you it was huge. How can I get around this and obtain the final source of this page in order to get scraping?
Thanks in advance!
Exceptions are thrown for several reasons, wrong html, errors on script's page, resources not found such css, script files or images files (for example <img src="bla.gif"> <- bla.gif not found HTML404)
So we use those options to keep html navegating without stoping on first error/problem we use :
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
You can also implement empty classes to stop htmlUnity go verbose on console about css/javaScript errors with:
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setJavaScriptErrorListener(new JavaScriptErrorListener(){});
The little sample test case:
@Test
public void TestCall() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setUseInsecureSSL(true); //ignore ssl certificate
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
String url = "https://www.wearvr.com/#game_id=game_4";
HtmlPage myPage = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(200);
webClient.waitForBackgroundJavaScript(20000);
//do stuff on page ex: myPage.getElementById("main")
//myPage.asXml() <- tags and elements
System.out.println(myPage.asText());
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With