Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium Python - Get a list of all loaded URLs (images, scripts, stylesheets etc)

When Google Chrome loads a web page through Selenium, it may load additional files required by the page, e.g. from <img src="example.com/a.png"> or <script src="example.com/a.js"> tags. In addition, CSS files.

How can I get a list of all URLs that were downloaded when the browser loaded a page? (Programatically, using Selenium in Python with chromedriver) That is to say, the list of files shown in the "Network" tab of the developer tools in Chrome (that shows a list of downloaded files).

Example Code using Selenium, chromedriver:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/x-www-browser"
driver = webdriver.Chrome("./chromedriver", chrome_options=options)
# Load some page
driver.get("https://example.com")
# Now, how do I see a list of downloaded URLs that took place when loading the page above?
like image 939
vatsug Avatar asked Jun 04 '18 10:06

vatsug


2 Answers

You might want to look at BrowserMob Proxy. It can capture performance data for web apps (via the HAR format), as well as manipulate browser behavior and traffic, such as whitelisting and blacklisting content, simulating network traffic and latency, and rewriting HTTP requests and responses.

Taken from readthedocs, the usage is simple and it integrates well with selenium webdriver api. You can read more about BMP here.

from browsermobproxy import Server
server = Server("path/to/browsermob-proxy")
server.start()
proxy = server.create_proxy()

from selenium import webdriver
profile  = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)


proxy.new_har("google")
driver.get("http://www.google.co.uk")
proxy.har # returns a HAR JSON blob

server.stop()
driver.quit()
like image 127
GPT14 Avatar answered Oct 21 '22 20:10

GPT14


Continuing the suggestion from @GPT14 in his answer, I wrote a small script which accomplishes exactly what I wanted and prints a list of URLs that a certain page loads.

This uses BrowserMob Proxy. Big thanks to @GPT14 for suggesting the use of this -- it works perfectly for our purposes. I have changed the code from his answer and adapted it to Google Chrome webdriver instead of Firefox. I have also extended the script so that it traverses the HAR JSON output and lists all request URLs. Remember to adapt the options below to your needs.

from browsermobproxy import Server
from selenium import webdriver

# Purpose of this script: List all resources (URLs) that
# Chrome downloads when visiting some page.

### OPTIONS ###
url = "https://example.com"
chromedriver_location = "./chromedriver" # Path containing the chromedriver
browsermobproxy_location = "/opt/browsermob-proxy-2.1.4/bin/browsermob-proxy" # location of the browsermob-proxy binary file (that starts a server)
chrome_location = "/usr/bin/x-www-browser"
###############

# Start browsermob proxy
server = Server(browsermobproxy_location)
server.start()
proxy = server.create_proxy()

# Setup Chrome webdriver - note: does not seem to work with headless On
options = webdriver.ChromeOptions()
options.binary_location = chrome_location
# Setup proxy to point to our browsermob so that it can track requests
options.add_argument('--proxy-server=%s' % proxy.proxy)
driver = webdriver.Chrome(chromedriver_location, chrome_options=options)

# Now load some page
proxy.new_har("Example")
driver.get(url)

# Print all URLs that were requested
entries = proxy.har['log']["entries"]
for entry in entries:
    if 'request' in entry.keys():
        print entry['request']['url']

server.stop()
driver.quit()
like image 31
vatsug Avatar answered Oct 21 '22 20:10

vatsug