I am trying to capture the source code from the URL specified into an HTML file using selenium, but I don't know why, I am not getting the exact source code which we see from the browser.
Below is my java code to capture the source in an HTML file
private static void getHTMLSourceFromURL(String url, String fileName) {
WebDriver driver = new FirefoxDriver();
driver.get(url);
try {
Thread.sleep(5000); //the page gets loaded completely
List<String> pageSource = new ArrayList<String>(Arrays.asList(driver.getPageSource().split("\n")));
writeTextToFile(pageSource, originalFile);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("quitting webdriver");
driver.quit();
}
/**
* creates file with fileName and writes the content
*
* @param content
* @param fileName
*/
private static void writeTextToFile(List<String> content, String fileName) {
PrintWriter pw = null;
String outputFolder = ".";
File output = null;
try {
File dir = new File(outputFolder + '/' + "HTML Sources");
if (!dir.exists()) {
boolean success = dir.mkdirs();
if (success == false) {
try {
throw new Exception(dir + " could not be created");
} catch (Exception e) {
e.printStackTrace();
}
}
}
output = new File(dir + "/" + fileName);
if (!output.exists()) {
try {
output.createNewFile();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
pw = new PrintWriter(new FileWriter(output, true));
for (String line : content) {
pw.print(line);
pw.print("\n");
}
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
pw.close();
}
}
Can someone throw some light into this as to why this happens? How WebDriver renders the page? And how browser shows the source?
We can get page source as it is in browser using Selenium webdriver using the getPageSource method. It allows us to obtain the code of the page source.
Selenium WebDriver provides methods to navigate to a url; these are driver. get() and driver. navigate().to() .
getPageSource() is method of WebDriver class. So driver. getPageSource() returns source code of the page which stored as string. contains is method of a String class to check if a string contains in another string.
Method to verify title We use getTitle() method to get the actual title of any web page. We store the title in the string and then we use Assert selenium command to return true or false. We can also use If-statement to compare actual and expected web page title.
There are several places where you can get the source from.You can try
String pageSource=driver.findElement(By.tagName("body")).getText();
and see what comes up.
Generally you do not need to wait for the page to load.Selenium does that automatically,unless you have separate sections of Javascript/Ajax.
You might want to add what are the differences that you are seeing, so that we can understand what you really mean.
Webdriver does not render the page on its own,it just renders it as the browser sees it.
I encountered the same problem. I use these code to solve it:
......
String javascript = "return arguments[0].innerHTML";
String pageSource=(String)(JavascriptExecutor)driver)
.executeScript(javascript, driver.findElement(By.tagName("html")));
pageSource = "<html>"+pageSource +"</html>";
System.out.println(pageSource);
//FileUtils.write(new File("e:\\test.html"), pageSource,);
......
By using JavaScript code to get the innerHTML property, it finally works, and the question marks disappeared.
The "source" code you get from Selenium seems to not be the source at all. It seems to be the HTML for the current DOM. The source code you see in the browser is the HTML as given by the server, before any dynamic changes made to it by JavaScript. If the DOM changes at all, the browser source code doesn't reflect those changes, but Selenium will. If you want to see the current DOM in a browser, you'd use the developer tools, not the source code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With