hello there i was wondering if it was possible to connect to a http host (I.e. for example google.com) and download the source of the webpage?
Thanks in advance.
To download a website's HTML source code, navigate using your favorite browser to the page, and then select SAVE PAGE AS from the FILE menu. You'll then be prompted to select whether you want to download the whole page (including images) or just the source code. The download options are common for all browsers.
Using urllib2 to download a page.
Google will block this request as it will try to block all robots. Add user-agent to the request.
import urllib2
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request('http://www.google.com', None, headers)
response = urllib2.urlopen(req)
page = response.read()
response.close() # its always safe to close an open connection
You can also use pyCurl
import sys
import pycurl
class ContentCallback:
def __init__(self):
self.contents = ''
def content_callback(self, buf):
self.contents = self.contents + buf
t = ContentCallback()
curlObj = pycurl.Curl()
curlObj.setopt(curlObj.URL, 'http://www.google.com')
curlObj.setopt(curlObj.WRITEFUNCTION, t.content_callback)
curlObj.perform()
curlObj.close()
print t.contents
Using requests package:
# Import requests
import requests
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = requests.get(url).content
or with the urllib
from urllib.request import urlopen
#url
url = 'https://www.google.com/'
# Create the binary string html containing the HTML source
html = urlopen(url).read()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With