Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium - driver.getPageSource() differs than the source viewed from browser

I am trying to capture the source code from the URL specified into an HTML file using selenium, but I don't know why, I am not getting the exact source code which we see from the browser.

Below is my java code to capture the source in an HTML file

private static void getHTMLSourceFromURL(String url, String fileName) {

    WebDriver driver = new FirefoxDriver();
    driver.get(url);

    try {
        Thread.sleep(5000);   //the page gets loaded completely

        List<String> pageSource = new ArrayList<String>(Arrays.asList(driver.getPageSource().split("\n")));

        writeTextToFile(pageSource, originalFile);

    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    System.out.println("quitting webdriver");
    driver.quit();
}

/**
 * creates file with fileName and writes the content
 * 
 * @param content
 * @param fileName
 */
private static void writeTextToFile(List<String> content, String fileName) {
    PrintWriter pw = null;
    String outputFolder = ".";
    File output = null;
    try {
        File dir = new File(outputFolder + '/' + "HTML Sources");
        if (!dir.exists()) {
            boolean success = dir.mkdirs();
            if (success == false) {
                try {
                    throw new Exception(dir + " could not be created");
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        output = new File(dir + "/" + fileName);
        if (!output.exists()) {
            try {
                output.createNewFile();
            } catch (IOException ioe) {
                ioe.printStackTrace();
            }
        }
        pw = new PrintWriter(new FileWriter(output, true));
        for (String line : content) {
            pw.print(line);
            pw.print("\n");
        }
    } catch (IOException ioe) {
        ioe.printStackTrace();
    } finally {
        pw.close();
    }

}

Can someone throw some light into this as to why this happens? How WebDriver renders the page? And how browser shows the source?

like image 962
roger_that Avatar asked Oct 14 '13 10:10

roger_that


People also ask

What is getPageSource method in Selenium?

We can get page source as it is in browser using Selenium webdriver using the getPageSource method. It allows us to obtain the code of the page source.

Which WebDriver method navigates to a URL in the browser?

Selenium WebDriver provides methods to navigate to a url; these are driver. get() and driver. navigate().to() .

Is getPageSource a class in Selenium?

getPageSource() is method of WebDriver class. So driver. getPageSource() returns source code of the page which stored as string. contains is method of a String class to check if a string contains in another string.

How do you validate that page title is present in page source?

Method to verify title We use getTitle() method to get the actual title of any web page. We store the title in the string and then we use Assert selenium command to return true or false. We can also use If-statement to compare actual and expected web page title.


3 Answers

There are several places where you can get the source from.You can try

String pageSource=driver.findElement(By.tagName("body")).getText();

and see what comes up.

Generally you do not need to wait for the page to load.Selenium does that automatically,unless you have separate sections of Javascript/Ajax.

You might want to add what are the differences that you are seeing, so that we can understand what you really mean.

Webdriver does not render the page on its own,it just renders it as the browser sees it.

like image 116
Madusudanan Avatar answered Sep 27 '22 01:09

Madusudanan


I encountered the same problem. I use these code to solve it:

......
String javascript = "return arguments[0].innerHTML";
String pageSource=(String)(JavascriptExecutor)driver)
    .executeScript(javascript, driver.findElement(By.tagName("html")));
pageSource = "<html>"+pageSource +"</html>";
System.out.println(pageSource);
//FileUtils.write(new File("e:\\test.html"), pageSource,);
......

By using JavaScript code to get the innerHTML property, it finally works, and the question marks disappeared.

like image 42
mikemelon Avatar answered Sep 25 '22 01:09

mikemelon


The "source" code you get from Selenium seems to not be the source at all. It seems to be the HTML for the current DOM. The source code you see in the browser is the HTML as given by the server, before any dynamic changes made to it by JavaScript. If the DOM changes at all, the browser source code doesn't reflect those changes, but Selenium will. If you want to see the current DOM in a browser, you'd use the developer tools, not the source code.

like image 29
Indigenuity Avatar answered Sep 27 '22 01:09

Indigenuity