How to list loaded resources with Selenium/PhantomJS?

Tags:

I want to load a webpage and list all loaded resources (javascript/images/css) for that page. I use this code to load the page:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')

The code above works perfectly and I can do some processing to the HTML page. The question is, how do I list all of the resources loaded by that page? I want something like this:

['http://example.com/img/logo.png',
 'http://example.com/css/style.css',
 'http://example.com/js/jquery.js',
 'http://www.google-analytics.com/ga.js']

I also open to other solution, like using PySide.QWebView module. I just want to list the resources loaded by page.

950

asked Nov 05 '13 10:11

flowfree

1 Answers

Here is a pure-python solution using Selenium and the ChromeDriver.

How it works:

First we create a minimalistic HTTP-proxy listening on localhost. This proxy is the one responsible for printing whatever requests are generated by Selenium.
(NOTE: we are using multiprocessing to avoid splitting the script in two, but you could just as well have the proxy part in a separate script)
Then we create the webdriver, configured with the proxy from step 1, read URLs from standard in and load them serially.
Loading the URLs in parallel is left as an exercise for the reader ;)

To use this script, you just type URLs in the standard in, and it spits out the loaded URLs (with respective referrers) on standard out. The code:

#!/usr/bin/python3

import sys
import time
import socketserver
import http.server
import urllib.request
from multiprocessing import Process

from selenium import webdriver

PROXY_PORT = 8889
PROXY_URL = 'localhost:%d' % PROXY_PORT

class Proxy(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        sys.stdout.write('%s → %s\n' % (self.headers.get('Referer', 'NO_REFERER'), self.path))
        self.copyfile(urllib.request.urlopen(self.path), self.wfile)
        sys.stdout.flush()

    @classmethod
    def target(cls):
        httpd = socketserver.ThreadingTCPServer(('', PROXY_PORT), cls)
        httpd.serve_forever()

p_proxy = Process(target=Proxy.target)
p_proxy.start()


webdriver.DesiredCapabilities.CHROME['proxy'] = {
    "httpProxy":PROXY_URL,
    "ftpProxy":None,
    "sslProxy":None,
    "noProxy":None,
    "proxyType":"MANUAL",
    "class":"org.openqa.selenium.Proxy",
    "autodetect":False
}

driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')
for url in sys.stdin:
    driver.get(url)
driver.close()
del driver
p_proxy.terminate()
p_proxy.join()
# avoid warnings about selenium.Service not shutting down in time
time.sleep(3)

198

answered Sep 24 '22 02:09

Leo Antunes

Related questions
                            
                                add a field in pandas dataframe with MultiIndex columns
                            
                                Starting new subproces from a Flask request
                            
                                How to mutate a ndb repeated property?
                            
                                socket ResourceWarning using urllib in Python 3
                            
                                What exactly is the "QuerySet" object in Mongoengine?
                            
                                Simple demonstration of using pyparsing's indentedBlock recursively
                            
                                Handling unhandled exception in GUI
                            
                                ignore spaces when comparing strings in python
                            
                                remove arguments passed to chrome by selenium / chromedriver
                            
                                Pycharm : how-to launch for a standard terminal (to solve an issue with curses)
                            
                                Error opening megawarc archive from Python
                            
                                File modification times not equal after calling shutil.copystat(file1, file2) under Windows
                            
                                Changing request method using hidden field _method in Flask
                            
                                How to securely store database password in Python? [closed]
                            
                                Optimizing for PyPy
                            
                                Setting package_dir to ..?
                            
                                Why do classes in the Python threading module expose factory functions instead of constructors?
                            
                                How to make a server discoverable to LAN clients
                            
                                Python sys.stdin.read(max) blocks until max is read (if max>=0), blocks until EOF else, but select indicates there is data to be read
                            
                                Python: Find principal value of an integral numerically

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to list loaded resources with Selenium/PhantomJS?

Tags:

python

selenium

phantomjs

qwebview

flowfree

People also ask

1 Answers

Leo Antunes

Recent Activity

Donate For Us