Download html in python?

Question

I am trying to download the html of a page that is requested through a javascript action when you click a link in the browser. I can download the first page because it has a general URL:

http://www.locationary.com/stats/hotzone.jsp?hz=1

But there are links along the bottom of the page that are numbers (1 to 10). So if you click on one, it goes to, for example, page 2:

http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2

When I put that URL into my program and try to download the html, it gives me the html of a different page on the website and I think it is the home page.

How can I get the html of this URL that uses javascript and when there is no specific URL?

Thanks.

Code:

import urllib
import urllib2
import cookielib
import re

URL = ''

def load(url):

    data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
    jar = cookielib.FileCookieJar("cookies")
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
    opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
    opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
    opener.addheaders.append(('Cookie','site_version=REGULAR'))
    request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
    response = opener.open(request)
    page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()

    h = response.info().headers
    jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
    data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
    jar = cookielib.FileCookieJar("cookies")
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
    opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
    opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
    opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
    request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
    response = opener.open(request)
    page = opener.open(url).read()
    print page

load(URL)

Paulo Scardine · Accepted Answer

The selenium webdriver from the selenium tool suite uses standard browsers to retrieve the HTML (it's main goal is test automation for web applications), so it is well suited for scrapping javascript-rich applications. It has nice Python bindings.

I tend to use selenium to grab the page source after all ajax stuff is fired and parse it with something like BeautifulSoup (BeautifulSoup copes well with malformed HTML).

Download html in python?

Tags:

python

html

Marcus Johnson

1 Answers

Paulo Scardine

Recent Activity

Donate For Us

Download html in python?

Tags:

python

html

Marcus Johnson

1 Answers

Paulo Scardine

Related questions

Recent Activity

Donate For Us