I run a query in one web page, then I get result url. If I right click see html source, I can see the html code generated by JS. If I simply use urllib, python cannot get the JS code. So I see some solution using selenium. Here's my code:
from selenium import webdriver
url = 'http://www.archives.com/member/Default.aspx?_act=VitalSearchResult&lastName=Smith&state=UT&country=US&deathYear=2004&deathYearSpan=10&location=UT&activityID=9b79d578-b2a7-4665-9021-b104999cf031&RecordType=2'
driver = webdriver.PhantomJS(executable_path='C:\python27\scripts\phantomjs.exe')
driver.get(url)
print driver.page_source
>>> <html><head></head><body></body></html> Obviously It's not right!!
Here's the source code I need in right click windows, (I want the INFORMATION part)
</script></div><div class="searchColRight"><div id="topActions" class="clearfix
noPrint"><div id="breadcrumbs" class="left"><a title="Results Summary"
href="Default.aspx? _act=VitalSearchR ...... <<INFORMATION I NEED>> ...
to view the entire record.</p></div><script xmlns:msxsl="urn:schemas-microsoft-com:xslt">
jQuery(document).ready(function() {
jQuery(".ancestry-information-tooltip").actooltip({
href: "#AncestryInformationTooltip", orientation: "bottomleft"});
});
So my question is: How to get the information generated by JS?
We can get HTML with JavaScript rendered source code by using Selenium webdriver. Selenium can execute JavaScript commands with the help of the executeScript method.
There are 2 ways to get the HTML source of a web element using Selenium: Method #1 – Read the innerHTML attribute to get the source of the content of the element. innerHTML is a property of a DOM element whose value is the HTML that exists in between the opening tag and ending tag.
You will need to get get the document via javascript
you can use seleniums execute_script
function
from time import sleep # this should go at the top of the file
sleep(5)
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html
That will get everything inside of the <html>
tag
It's not necessary to use that workaround, you can use instead:
driver = webdriver.PhantomJS()
driver.get('http://www.google.com/')
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With