How to save "complete webpage" not just basic html using Python

People also ask

How do I save a webpage in python?

To save a page we shall first obtain the page source behind the webpage with the help of the page_source method. We shall open a file with a particular encoding with the codecs. open method. The file has to be opened in the write mode represented by w and encoding type as utf−8.

Can you make a website completely with python?

Can you make a website using Python? The answer is yes, you can make a website with Python - quite easily in fact. Although Python is a general-purpose programming language, that naturally extends into web programming.

How do I save a webpage as a HTML file?

Press CTRL+S. Right-click within the HTML document, click File > Save.

Is HTML better than python for web development?

They have different functions, if you want to design web pages, learn HTML; if you want to do pretty much anything else, python is a better bet. To be frank, HTML is not a programming language, it is a markup language mainly used for creating the websites, the DOM where all the magic happens.

Try emulating your browser with selenium. This script will pop up the save as dialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).

Click to copy

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

br = webdriver.Firefox()
br.get('http://www.google.com/')

save_me = ActionChains(br).key_down(Keys.CONTROL)\
         .key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()

Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_source will get you the entire dom along with the dynamic content generated by javascript.

You can easily do that with simple python library pywebcopy.

For Current version: 5.0.1

Click to copy

from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

To get the script above by @rajatomar788 to run, I had to do all of the following imports first:

To run pywebcopy you will need to install the following packages:

Click to copy

pip install pywebcopy 
pip install pyquery
pip install w3lib
pip install parse 
pip install lxml

After that it worked with a few errors, but I did get the folder filled with the files that make up the webpage.

Click to copy

webpage    - INFO     - Starting save_assets Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'
webpage    - Level 100 - Queueing download of <89> asset files.
Exception in thread <Element(LinkTag, file:///++resource++images/favicon2.ico)>:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 312, in run
    super(LinkTag, self).run()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 58, in run
    self.download_file()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\elements.py", line 107, in download_file
    req = SESSION.get(url, stream=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pywebcopy\configs.py", line 244, in get
    return super(AccessAwareSession, self).get(url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///++resource++images/favicon2.ico'

webpage    - INFO     - Starting save_html Action on url: 'http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html'

Related questions
                            
                                Python Numpy - Complex Numbers - Is there a function for Polar to Rectangular conversion?
                            
                                Sphinx: force rebuild of html, including autodoc
                            
                                MongoDB not allowing using '.' in key [duplicate]
                            
                                python - check if any value of dict is not None (without iterators)
                            
                                Web scraping - how to access content rendered in JavaScript via Angular.js?
                            
                                keras: what is the difference between model.predict and model.predict_proba
                            
                                Why is deque implemented as a linked list instead of a circular array?
                            
                                How to specify in the pipfile package from custom git branch using pipfile?
                            
                                Deprecation warning from Jupyter: "`should_run_async` will not call `transform_cell` automatically in the future"
                            
                                Concurrency: Are Python extensions written in C/C++ affected by the Global Interpreter Lock?
                            
                                What is the analog for .Net InvalidOperationException in Python?
                            
                                Python and ctypes: how to correctly pass "pointer-to-pointer" into DLL?
                            
                                What are the benefits of pip and virtualenv?
                            
                                Python: Mock side_effect on object attribute
                            
                                append subprocess.Popen output to file?
                            
                                Variable scope and Try Catch in python
                            
                                Cannot install py2exe with Python 2.7
                            
                                How to get SVMs to play nicely with missing data in scikit-learn?
                            
                                How to open ssl socket using certificate stored in string variables in python
                            
                                IncompleteRead using httplib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to save "complete webpage" not just basic html using Python

Tags:

python

html

urllib

urllib2

python-2.7

People also ask

To run pywebcopy you will need to install the following packages:

Recent Activity

Donate For Us