Save complete web page (incl css, images) using python/selenium

Tags:

I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want:

from selenium import webdriver  URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA' CHROME_WEBDRIVER_LOCATION = '/home/max/Downloads/chromedriver' # update this for your machine  # open page with selenium # (first need to download Chrome webdriver, or a firefox webdriver, etc) driver = webdriver.Chrome(executable_path=CHROME_WEBDRIVER_LOCATION) driver.get(URL) time.sleep(5)  # enter sequence into the query field and hit 'blast' button to search seq_query_field = driver.find_element_by_id("seq") seq_query_field.send_keys(SEQUENCE)  blast_button = driver.find_element_by_id("b1") blast_button.click() time.sleep(60)

At that point I have a page that I can manually click "save as," and get a local file (with a corresponding folder of image/js assets) that lets me view the whole returned page locally (minus content which is generated dynamically from scrolling down the page, which is fine). I assumed there would be a simple way to mimic this 'save as' function in python/selenium but haven't found one. The code to save the page below just saves html, and does not leave me with a local file that looks like it does in the web browser, with images, etc.

content = driver.page_source with open('webpage.html', 'w') as f:     f.write(content)

I've also found this question/answer on SO, but the accepted answer just brings up the 'save as' box, and does not provide a way to click it (as two commenters point out)

Is there a simple way to 'save [full page] as' using python? Ideally I'd prefer an answer using selenium since selenium makes the crawling part so straightforward, but I'm open to using another library if there's a better tool for this job. Or maybe I just need to specify all of the images/tables I want to download in code, and there is no shortcut to emulating the right-click 'save as' functionality?

UPDATE - Follow up question for James' answer So I ran James' code to generate a page.html (and associated files) and compared it to the html file I got from manually clicking save-as. The page.html saved via James' script is great and has everything I need, but when opened in a browser it also shows a lot of extra formatting text that's hidden in the manually save'd page. See attached screenshot (manually saved page on the left, script-saved page with extra formatting text shown on right). enter image description here

This is especially surprising to me because the raw html of the page saved by James' script seems to indicate those fields should still be hidden. See e.g. the html below, which appears the same in both files, but the text at issue only appears in the browser-rendered page on the one saved by James' script:

<p class="helpbox ui-ncbitoggler-slave ui-ncbitoggler" id="hlp1" aria-hidden="true"> These options control formatting of alignments in results pages. The default is HTML, but other formats (including plain text) are available. PSSM and PssmWithParameters are representations of Position Specific Scoring Matrices and are only available for PSI-BLAST.  The Advanced view option allows the database descriptions to be sorted by various indices in a table. </p>

Any idea why this is happening?

912

asked Dec 11 '18 17:12

Max Power

2 Answers

As you noted, Selenium cannot interact with the browser's context menu to use Save as..., so instead to do so, you could use an external automation library like pyautogui.

pyautogui.hotkey('ctrl', 's') time.sleep(1) pyautogui.typewrite(SEQUENCE + '.html') pyautogui.hotkey('enter')

This code opens the Save as... window through its keyboard shortcut CTRL+S and then saves the webpage and its assets into the default downloads location by pressing enter. This code also names the file as the sequence in order to give it a unique name, though you could change this for your use case. If needed, you could additionally change the download location through some extra work with the tab and arrow keys.

Tested on Ubuntu 18.10; depending on your OS you may need to modify the key combination sent.

Full code, in which I also added conditional waits to improve speed:

import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.expected_conditions import visibility_of_element_located from selenium.webdriver.support.ui import WebDriverWait import pyautogui  URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA' #'GAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGA'  # open page with selenium # (first need to download Chrome webdriver, or a firefox webdriver, etc) driver = webdriver.Chrome() driver.get(URL)  # enter sequence into the query field and hit 'blast' button to search seq_query_field = driver.find_element_by_id("seq") seq_query_field.send_keys(SEQUENCE)  blast_button = driver.find_element_by_id("b1") blast_button.click()  # wait until results are loaded WebDriverWait(driver, 60).until(visibility_of_element_located((By.ID, 'grView')))  # open 'Save as...' to save html and assets pyautogui.hotkey('ctrl', 's') time.sleep(1) pyautogui.typewrite(SEQUENCE + '.html') pyautogui.hotkey('enter')

answered Sep 20 '22 15:09

FThompson

This is not a perfect solution, but it will get you most of what you need. You can replicate the behavior of "save as full web page (complete)" by parsing the html and downloading any loaded files (images, css, js, etc.) to their same relative path.

Most of the javascript won't work due to cross origin request blocking. But the content will look (mostly) the same.

This uses requests to save the loaded files, lxml to parse the html, and os for the path legwork.

from selenium import webdriver import chromedriver_binary from lxml import html import requests import os  driver = webdriver.Chrome() URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA'  base = 'https://blast.ncbi.nlm.nih.gov/'  driver.get(URL) seq_query_field = driver.find_element_by_id("seq") seq_query_field.send_keys(SEQUENCE) blast_button = driver.find_element_by_id("b1") blast_button.click()  content = driver.page_source # write the page content os.mkdir('page') with open('page/page.html', 'w') as fp:     fp.write(content)  # download the referenced files to the same path as in the html sess = requests.Session() sess.get(base)            # sets cookies  # parse html h = html.fromstring(content) # get css/js files loaded in the head for hr in h.xpath('head//@href'):     if not hr.startswith('http'):         local_path = 'page/' + hr         hr = base + hr     res = sess.get(hr)     if not os.path.exists(os.path.dirname(local_path)):         os.makedirs(os.path.dirname(local_path))     with open(local_path, 'wb') as fp:         fp.write(res.content)  # get image/js files from the body.  skip anything loaded from outside sources for src in h.xpath('//@src'):     if not src or src.startswith('http'):         continue     local_path = 'page/' + src     print(local_path)     src = base + src     res = sess.get(hr)     if not os.path.exists(os.path.dirname(local_path)):         os.makedirs(os.path.dirname(local_path))     with open(local_path, 'wb') as fp:         fp.write(res.content)

You should have a folder called page with a file called page.html in it with the content you are after.

answered Sep 16 '22 15:09

James

Related questions
                            
                                What is the best way to share MasterPages across projects
                            
                                Aren't Information Expert & Tell Don't Ask at odds with Single Responsibility Principle?
                            
                                Convert a GIF into a CUR file [closed]
                            
                                mod_rewrite GUI?
                            
                                ASP.NET MVC & SQL Server Reporting Services
                            
                                Method overloads which differ only by generic constraint
                            
                                Valid use of accessors in init and dealloc methods?
                            
                                pytz: Why is normalize needed when converting between timezones?
                            
                                multiple word Predictive/autocomplete textarea? [closed]
                            
                                Is it possible to pass an arbitrary method group as a parameter to a method?
                            
                                Does the STL contain a hashtable? [duplicate]
                            
                                How can I get notified when a UIView becomes visible?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With