Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse html that includes javascript code

How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.

As a simple example

<a href="javascript:web_link(34, true);">link</a>

should be replaced by the appropriate value the javascript function returns, e.g.

<a href="http://www.example.com">link</a>

A more complex example would be a saved facebook html page which is littered with loads of javascript code.

Probably related to How to "execute" HTML+Javascript page with Node.js but do I really need Node.js and JSDOM? Also slightly related is Python library for rendering HTML and javascript but I'm not interested in rendering just the pure html output.

like image 730
tom Avatar asked Aug 15 '11 10:08

tom


People also ask

What is parseHTML in JavaScript?

parseHTML uses native methods to convert the string to a set of DOM nodes, which can then be inserted into the document. These methods do render all trailing or leading text (even if that's just whitespace).

How do I parse HTML code?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.


2 Answers

You can use Selenium with python as detailed here

Example:

import xmlrpclib

# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)

# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)

import os
os.system('start run_firefox.bat')

print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
like image 161
PabloG Avatar answered Oct 05 '22 09:10

PabloG


From Mozilla Gecko FAQ:

Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?

A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.

Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.

like image 45
Jonas G. Drange Avatar answered Oct 05 '22 09:10

Jonas G. Drange