Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download html in python?

Tags:

python

html

I am trying to download the html of a page that is requested through a javascript action when you click a link in the browser. I can download the first page because it has a general URL:

http://www.locationary.com/stats/hotzone.jsp?hz=1

But there are links along the bottom of the page that are numbers (1 to 10). So if you click on one, it goes to, for example, page 2:

http://www.locationary.com/stats/hotzone.jsp?ACTION_TOKEN=hotzone_jsp$JspView$NumericAction&inPageNumber=2

When I put that URL into my program and try to download the html, it gives me the html of a different page on the website and I think it is the home page.

How can I get the html of this URL that uses javascript and when there is no specific URL?

Thanks.

Code:

import urllib
import urllib2
import cookielib
import re

URL = ''

def load(url):

    data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
    jar = cookielib.FileCookieJar("cookies")
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
    opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
    opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
    opener.addheaders.append(('Cookie','site_version=REGULAR'))
    request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
    response = opener.open(request)
    page = opener.open("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction").read()

    h = response.info().headers
    jsid = re.findall(r'Set-Cookie: (.*);', str(h[5]))
    data = urllib.urlencode({"inUserName":"email", "inUserPass":"password"})
    jar = cookielib.FileCookieJar("cookies")
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
    opener.addheaders.append(('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:13.0) Gecko/20100101 Firefox/13.0.1'))
    opener.addheaders.append(('Referer', 'http://www.locationary.com/'))
    opener.addheaders.append(('Cookie','site_version=REGULAR; ' + str(jsid[0])))
    request = urllib2.Request("https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction", data)
    response = opener.open(request)
    page = opener.open(url).read()
    print page

load(URL)
like image 766
Marcus Johnson Avatar asked Nov 14 '22 00:11

Marcus Johnson


1 Answers

The selenium webdriver from the selenium tool suite uses standard browsers to retrieve the HTML (it's main goal is test automation for web applications), so it is well suited for scrapping javascript-rich applications. It has nice Python bindings.

I tend to use selenium to grab the page source after all ajax stuff is fired and parse it with something like BeautifulSoup (BeautifulSoup copes well with malformed HTML).

like image 164
Paulo Scardine Avatar answered Jan 06 '23 16:01

Paulo Scardine