I want to load a webpage and list all loaded resources (javascript/images/css) for that page. I use this code to load the page:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')
The code above works perfectly and I can do some processing to the HTML page. The question is, how do I list all of the resources loaded by that page? I want something like this:
['http://example.com/img/logo.png',
'http://example.com/css/style.css',
'http://example.com/js/jquery.js',
'http://www.google-analytics.com/ga.js']
I also open to other solution, like using PySide.QWebView
module. I just want to list the resources loaded by page.
Being a headless browser, the interactions are much faster than the real browser. So the performance time is smoother in PhantomJS than in Selenium.
PhantomJS is a headless Webkit, which has a number of uses. In this example, we'll be using it, in conjunction with Selenium WebDriver, for conducting basic system tests directly from the command line. Since PhantomJS eliminates the need for a graphical browser, tests run much faster.
What is headless browser in Selenium Python? A headless browser is a web browser without a user interface, it means the browser is running in the background (invisbile). This is great if you want to start a web browser to do tasks, but you don't want or need to see it.
Here is a pure-python solution using Selenium and the ChromeDriver.
How it works:
multiprocessing
to avoid splitting the script in two, but you could just as well have the proxy part in a separate script)webdriver
, configured with the proxy from step 1, read URLs from standard in and load them serially.To use this script, you just type URLs in the standard in, and it spits out the loaded URLs (with respective referrers) on standard out. The code:
#!/usr/bin/python3
import sys
import time
import socketserver
import http.server
import urllib.request
from multiprocessing import Process
from selenium import webdriver
PROXY_PORT = 8889
PROXY_URL = 'localhost:%d' % PROXY_PORT
class Proxy(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
sys.stdout.write('%s → %s\n' % (self.headers.get('Referer', 'NO_REFERER'), self.path))
self.copyfile(urllib.request.urlopen(self.path), self.wfile)
sys.stdout.flush()
@classmethod
def target(cls):
httpd = socketserver.ThreadingTCPServer(('', PROXY_PORT), cls)
httpd.serve_forever()
p_proxy = Process(target=Proxy.target)
p_proxy.start()
webdriver.DesiredCapabilities.CHROME['proxy'] = {
"httpProxy":PROXY_URL,
"ftpProxy":None,
"sslProxy":None,
"noProxy":None,
"proxyType":"MANUAL",
"class":"org.openqa.selenium.Proxy",
"autodetect":False
}
driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')
for url in sys.stdin:
driver.get(url)
driver.close()
del driver
p_proxy.terminate()
p_proxy.join()
# avoid warnings about selenium.Service not shutting down in time
time.sleep(3)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With