How to parse html that includes javascript code

Tags:

How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.

As a simple example

<a href="javascript:web_link(34, true);">link</a>

should be replaced by the appropriate value the javascript function returns, e.g.

<a href="http://www.example.com">link</a>

A more complex example would be a saved facebook html page which is littered with loads of javascript code.

Probably related to How to "execute" HTML+Javascript page with Node.js but do I really need Node.js and JSDOM? Also slightly related is Python library for rendering HTML and javascript but I'm not interested in rendering just the pure html output.

730

asked Aug 15 '11 10:08

tom

2 Answers

You can use Selenium with python as detailed here

Example:

import xmlrpclib

# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)

# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)

import os
os.system('start run_firefox.bat')

print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()

161

answered Oct 05 '22 09:10

PabloG

From Mozilla Gecko FAQ:

Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?

A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.

Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.

answered Oct 05 '22 09:10

Jonas G. Drange

Related questions
                            
                                javascript jquery find left position of element in horizontal scroll container
                            
                                How do I bind a callback function on the complete jstree reload event?
                            
                                iframe onload in IE7/8 with Javascript
                            
                                Change content of a webpage dynamically with a Firefox extension using Javascript without Greasemonkey
                            
                                scroll window when mouse moves
                            
                                How to kill zombie cookies
                            
                                In CouchDB shows, what does 'this' refer to?
                            
                                Javascript open content type
                            
                                How do I determine whether data on a form has been input by the user or the browser?
                            
                                aligning HTML select elements on one line
                            
                                each implementation in the underscore.js library
                            
                                How should closures be formatted?
                            
                                Why do object literals in javascript save unnecessary DOM references?
                            
                                multi threading using an iframe
                            
                                Easiest javascript library for making custom tabs?
                            
                                Can I change a non-resizable existing browser window with Javascript to be resizable?
                            
                                Scraping Javascript variables to PHP
                            
                                Javascript Collision Detection
                            
                                How can I make the first letter of a variable always capital? [duplicate]
                            
                                How do I load a JavaScript file with Jint in C#?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse html that includes javascript code

Tags:

python

javascript

html

tom

People also ask

2 Answers

PabloG

Jonas G. Drange

Recent Activity

Donate For Us