Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Selenium: how can I get the HTML of a webpage without first loading the page?

Using Selenium WebDriver for Java, is it possible to get the HTML of a webpage given a specified URL?

I know that, once a webpage is loaded in a browser, the HTML can be obtained using WebDriver.getPageSource(). However, for improved efficiency, is it possible to obtain the HTML without loading the page in a browser first?

like image 354
danger mouse Avatar asked Jul 24 '16 08:07

danger mouse


2 Answers

You can achieve this using headless browser.

A headless browser is a web-browser without a graphical user interface. This program will behave just like a browser but will not show any GUI.

Headless browsers are typically used in following situations :-

  • You have a central build tool which does not have any browser installed on it. So to do the basic level of sanity tests after every build you may use the headless browser to run your tests.

  • You want to write a crawler program that goes through different pages and collects data, headless browser will be your choice. Because you really don’t care about opening a browser. All you need is to access the webpages.

  • You would like to simulate multiple browser versions on the same machine. In that case you would want to use a headless browser, because most of them support simulation of different versions of browsers. We will come to this point soon.

Things to pay attention to before using headless browser

Headless browsers are simulation programs, they are not your real browsers. Most of these headless browsers have evolved enough to simulate, to a pretty close approximation, like a real browser. Still you would not want to run all your tests in a headless browser. JavaScript is one area where you would want to be really careful before using a Headless browser. JavaScript are implemented differently by different browsers. Although JavaScript is a standard but each browser has its own little differences in the way that they have implemented JavaScript. This is also true in case of headless browsers also. For example HtmlUnit headless browser uses the Rihno JavaScript engine which not being used by any other browser.

Some of the examples of Headless Drivers include

  • HtmlUnit
  • Ghost
  • PhantomJS
  • ZombieJS
  • Watir-webdriver
like image 90
Saurabh Gaur Avatar answered Oct 23 '22 04:10

Saurabh Gaur


httpRequest in JAVA:

public static String executePost(String targetURL, String urlParameters) {
  HttpURLConnection connection = null;

  try {
    //Create connection
    URL url = new URL(targetURL);
    connection = (HttpURLConnection) url.openConnection();
    connection.setRequestMethod("POST");
    connection.setRequestProperty("Content-Type", 
        "application/x-www-form-urlencoded");

    connection.setRequestProperty("Content-Length", 
        Integer.toString(urlParameters.getBytes().length));
    connection.setRequestProperty("Content-Language", "en-US");  

    connection.setUseCaches(false);
    connection.setDoOutput(true);

    //Send request
    DataOutputStream wr = new DataOutputStream (
        connection.getOutputStream());
    wr.writeBytes(urlParameters);
    wr.close();

    //Get Response  
    InputStream is = connection.getInputStream();
    BufferedReader rd = new BufferedReader(new InputStreamReader(is));
    StringBuilder response = new StringBuilder(); 

    String line;
    while ((line = rd.readLine()) != null) {
      response.append(line);
      response.append('\r');
    }
    rd.close();
    return response.toString();
  } catch (Exception e) {
    e.printStackTrace();
    return null;
  } finally {
    if (connection != null) {
      connection.disconnect();
    }
  }
}
like image 34
Leon Barkan Avatar answered Oct 23 '22 04:10

Leon Barkan