Is there any python module for rendering a HTML page with javascript and get back a DOM object?
I want to parse a page which generates almost all of its content using javascript.
You can use Python and its modules inside JavaScript with Promise API. You can test it with your favorite python modules such as Numpy, Pandas, pyautogui etc at this point or other built in modules if you want.
The big complication here is emulating the full browser environment outside of a browser. You can use stand alone javascript interpreters like Rhino and SpiderMonkey to run javascript code but they don't provide a complete browser like environment to full render a web page.
If I needed to solve a problem like this I would first look at how the javascript is rendering the page, it's quite possible it's fetching data via AJAX and using that to render the page. I could then use python libraries like simplejson and httplib2 to directly fetch the data and use that, negating the need to access the DOM object. However, that's only one possible situation, I don't know the exact problem you are solving.
Other options include the selenium one mentioned by Łukasz, some kind of webkit embedded craziness, some kind of IE win32 scripting craziness or, finally, a pyxpcom based solution (with added craziness). All these have the drawback of requiring pretty much a fully running web browser for python to play with, which might not be an option depending on your environment.
You can probably use python-webkit for it. Requires a running glib and GTK, but that's probably less problematic than wrapping the parts of webkit without glib.
I don't know if it does everything you need, but I guess you should give it a try.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With