How to download a full webpage with a Python script?

Tags:

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website).

My current code is:

import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")

I visited many questions but they are all only downloading the HTML.

617

asked Jul 03 '15 11:07

Rahul Satal

2 Answers

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2
from BeautifulSoup import *
from urlparse import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse  import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print( "Could not open %s" % page)
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )

173

answered Oct 23 '22 10:10

Sam Al-Ghammari

You can easily do that with simple python library pywebcopy.

For Current version: 5.0.1


from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

answered Oct 23 '22 11:10

rajatomar788

Related questions
                            
                                How do I install Python/Django Modules?
                            
                                find xml element based on its attribute and change its value
                            
                                is getPerspectiveTransform broken in opencv python2 wrapper?
                            
                                replacing layout on a QWidget with another layout
                            
                                How to retrieve multiple values returned of a function called through multiprocessing.Process
                            
                                Python - intersection between a list and keys of a dictionary
                            
                                Python if-statement with variable mathematical operator
                            
                                How do I call a specific Method from a Python Script in C#?
                            
                                How do I install Socks / SocksIPy on Ubuntu?
                            
                                Ignore KeyError and continue program
                            
                                How to find integer nth roots?
                            
                                Interactive plotting with Python via command line
                            
                                Pip install error. Setuptools.command not found
                            
                                Changing marker style in scatter plot according to third variable
                            
                                Getting PyCharm to recognize Anaconda's SciPy
                            
                                Two different color colormaps in the same imshow matplotlib
                            
                                Django 1.7 where to put the code to add Groups programmatically?
                            
                                How To Resize a Video Clip in Python
                            
                                What do [] brackets in a for loop in python mean?
                            
                                Extracting polygon given coordinates from an image using OpenCV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to download a full webpage with a Python script?

Tags:

python

request

beautifulsoup

Rahul Satal

People also ask

2 Answers

Sam Al-Ghammari

rajatomar788

Recent Activity

Donate For Us