Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python http download page source

Tags:

python

http

hello there i was wondering if it was possible to connect to a http host (I.e. for example google.com) and download the source of the webpage?

Thanks in advance.

like image 446
DonJuma Avatar asked Oct 16 '10 16:10

DonJuma


People also ask

How do I download source code from browser?

To download a website's HTML source code, navigate using your favorite browser to the page, and then select SAVE PAGE AS from the FILE menu. You'll then be prompted to select whether you want to download the whole page (including images) or just the source code. The download options are common for all browsers.


2 Answers

Using urllib2 to download a page.

Google will block this request as it will try to block all robots. Add user-agent to the request.

import urllib2
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request('http://www.google.com', None, headers)
response = urllib2.urlopen(req)
page = response.read()
response.close() # its always safe to close an open connection

You can also use pyCurl

import sys
import pycurl

class ContentCallback:
        def __init__(self):
                self.contents = ''

        def content_callback(self, buf):
                self.contents = self.contents + buf

t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents
like image 69
pyfunc Avatar answered Oct 23 '22 14:10

pyfunc


Using requests package:

# Import requests
import requests

#url
url = 'https://www.google.com/'

# Create the binary string html containing the HTML source
html = requests.get(url).content

or with the urllib

from urllib.request import urlopen

#url
url = 'https://www.google.com/'

# Create the binary string html containing the HTML source
html = urlopen(url).read()
like image 30
Orkun Berk Yuzbasioglu Avatar answered Oct 23 '22 14:10

Orkun Berk Yuzbasioglu