I am trying to download the html of a page that is requested through a javascript action when you click a link in the browser. I can download the first page because it has a general URL:
http://www.locationary.com/stats/hotzone.jsp?hz=1
But there are links along the bottom of the page that are numbers (1 to 10). So if you click on one, it goes to, for example, page 2:
http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2
When I put that URL into my program and try to download the html, it gives me the html of a different page on the website and I think it is the home page.
How can I get the html of this URL that uses javascript and when there is no specific URL?
Thanks.
Code:
import urllib
import urllib2
import cookielib
import re
URL = ''
def load(url):
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR'))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()
h = response.info().headers
jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
jar = cookielib.FileCookieJar("cookies")
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
response = opener.open(request)
page = opener.open(url).read()
print page
load(URL)
The selenium webdriver from the selenium tool suite uses standard browsers to retrieve the HTML (it's main goal is test automation for web applications), so it is well suited for scrapping javascript-rich applications. It has nice Python bindings.
I tend to use selenium to grab the page source after all ajax stuff is fired and parse it with something like BeautifulSoup (BeautifulSoup copes well with malformed HTML).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With