Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list loaded resources with Selenium/PhantomJS?

I want to load a webpage and list all loaded resources (javascript/images/css) for that page. I use this code to load the page:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')

The code above works perfectly and I can do some processing to the HTML page. The question is, how do I list all of the resources loaded by that page? I want something like this:

['http://example.com/img/logo.png',
 'http://example.com/css/style.css',
 'http://example.com/js/jquery.js',
 'http://www.google-analytics.com/ga.js']

I also open to other solution, like using PySide.QWebView module. I just want to list the resources loaded by page.

like image 950
flowfree Avatar asked Nov 05 '13 10:11

flowfree


People also ask

Is PhantomJS faster than Selenium?

Being a headless browser, the interactions are much faster than the real browser. So the performance time is smoother in PhantomJS than in Selenium.

What is PhantomJS Python?

PhantomJS is a headless Webkit, which has a number of uses. In this example, we'll be using it, in conjunction with Selenium WebDriver, for conducting basic system tests directly from the command line. Since PhantomJS eliminates the need for a graphical browser, tests run much faster.

What is a headless browser Python?

What is headless browser in Selenium Python? A headless browser is a web browser without a user interface, it means the browser is running in the background (invisbile). This is great if you want to start a web browser to do tasks, but you don't want or need to see it.


1 Answers

Here is a pure-python solution using Selenium and the ChromeDriver.

How it works:

  1. First we create a minimalistic HTTP-proxy listening on localhost. This proxy is the one responsible for printing whatever requests are generated by Selenium.
    (NOTE: we are using multiprocessing to avoid splitting the script in two, but you could just as well have the proxy part in a separate script)
  2. Then we create the webdriver, configured with the proxy from step 1, read URLs from standard in and load them serially.
    Loading the URLs in parallel is left as an exercise for the reader ;)

To use this script, you just type URLs in the standard in, and it spits out the loaded URLs (with respective referrers) on standard out. The code:

#!/usr/bin/python3

import sys
import time
import socketserver
import http.server
import urllib.request
from multiprocessing import Process

from selenium import webdriver

PROXY_PORT = 8889
PROXY_URL = 'localhost:%d' % PROXY_PORT

class Proxy(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        sys.stdout.write('%s → %s\n' % (self.headers.get('Referer', 'NO_REFERER'), self.path))
        self.copyfile(urllib.request.urlopen(self.path), self.wfile)
        sys.stdout.flush()

    @classmethod
    def target(cls):
        httpd = socketserver.ThreadingTCPServer(('', PROXY_PORT), cls)
        httpd.serve_forever()

p_proxy = Process(target=Proxy.target)
p_proxy.start()


webdriver.DesiredCapabilities.CHROME['proxy'] = {
    "httpProxy":PROXY_URL,
    "ftpProxy":None,
    "sslProxy":None,
    "noProxy":None,
    "proxyType":"MANUAL",
    "class":"org.openqa.selenium.Proxy",
    "autodetect":False
}

driver = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver')
for url in sys.stdin:
    driver.get(url)
driver.close()
del driver
p_proxy.terminate()
p_proxy.join()
# avoid warnings about selenium.Service not shutting down in time
time.sleep(3)
like image 198
Leo Antunes Avatar answered Sep 24 '22 02:09

Leo Antunes