Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impossible site for HtmlUnit?

I cannot, for the life of me, rig HtmlUnit up to grab this site:

http://www.bing.com/travel/flight/flightSearch?form=FORMTRVLGENERIC&q=flights+from+SLC+to+BKK+leave+07%2F30%2F2010+return+08%2F11%2F2010+adults%3A1+class%3ACOACH&stoc=0&vo1=Salt+Lake+City%2C+UT+%28SLC%29+-+Salt+Lake+City+International+Airport&o=SLC&ve1=Bangkok%2C+Thailand+%28BKK%29+-+Suvarnabhumi+International&e=BKK&d1=07%2F30%2F2010&r1=08%2F11%2F2010&p=1&b=COACH&baf=true

I'm sure it has to do with the vast amounts of scripts running in the background. Perhaps these scripts aren't being given enough time to fully load?

I've also tried simply grabbing bing.com/travel, and no success either. It's breaking on the getPage function of the new HtmlPage client.

The output gives a plethora of runtimeErrors ("data necessary to complete this operation is not yet available"), all for the same sourceName ("http://www.bing.com/travel/jsxc.vjs?a=common&v=5.5.0-1278007084280")

Then a couple exceptions thrown for a missing "(" in a couple scripts on bing.com.

Then it calls javascript, then abruptly ends.

I realize this could be a handful of problems that others might not be able to see, and so if there are no suggestions, would someone mind pumping these two sites through a test implementation of their own HtmlUnit use and see if they can get basic output of the XML or text results? I'm not trying to do anything fancy here, just get some basic text or XML output of the results.

It'd be handy to know if someone else's implementation works so I can keep jury-rigging mine to completion.

CODE:

import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.WebClient;

public class test {

public static void main(String[] args) throws Exception {

        WebClient client = new WebClient();
        System.out.println("webclient loaded");

        HtmlPage currentPage = client.getPage("http://www.bing.com/travel/flight/flightSearch?form=FORMTRVLGENERIC&q=flights+from+SLC+to+BKK+leave+07%2F30%2F2010+return+08%2F11%2F2010+adults%3A1+class%3ACOACH&stoc=0&vo1=Salt+Lake+City%2C+UT+%28SLC%29+-+Salt+Lake+City+International+Airport&o=SLC&ve1=Bangkok%2C+Thailand+%28BKK%29+-+Suvarnabhumi+International&e=BKK&d1=07%2F30%2F2010&r1=08%2F11%2F2010&p=1&b=COACH&baf=true");
        client.waitForBackgroundJavaScript(10000);
        System.out.println("htmlpage init'd");

        //System.out.println(currentPage.getTitleText());
        String textSource = currentPage.asXml();
        System.out.println(textSource);

}

}

Thanks!

like image 409
Stu Kalide Avatar asked Jul 15 '10 06:07

Stu Kalide


3 Answers

Try adding this:

client.setThrowExceptionOnScriptError( false ) ;

It takes a long time to run, and boy does it spew out logging... but eventually a page came out:

htmlpage init'd
<?xml version="1.0" encoding="utf-8"?>
<html id="">
  <head>
   ...
like image 189
Rodney Gitzel Avatar answered Nov 17 '22 05:11

Rodney Gitzel


I also had the problem with "data necessary to complete this operation is not yet available".
Switching the user-agent to "Firefox" helped...
http://steveliles.github.com/jquery_htmlunit_runtimeerror_messages_galore.html

like image 31
Alexander Link Avatar answered Nov 17 '22 05:11

Alexander Link


Browsers have a high tolerance for what they might detect as errors (in Javascript, but also HTML, css and so on). This is partly because of various conflicting "standards" :) of how Javascript got implemented. Something that appears OK on one browser gets problems on another. So when all these messages are made visible it should be a little disconcerting.

To put this in perspective - in Internet Explorer go into your settings and check the "Advanced Settings" for "Display a notification about every script error" and then browse the same sites. You might be surprised at how much code IE gets by just ignoring what it might detect as problems.

Using HtmlUnit under various browsers just brings some of these conflicts to light.

Telling HtmlUnit to do something like "Ignore...for this browser" is a perfectly valid practice. In my case, I am bringing in data from a site that checks that all the users are using Internet Explorer (No, I have no good idea why they do that), so I can't proceed without ignoring the javascript errors. Interestingly, the site works fine even though IE thinks there're lots of Javascript errors.

like image 2
Pete Kelley Avatar answered Nov 17 '22 05:11

Pete Kelley