Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get a browser rendered html+javascript

Tags:

browser

linux

I need a comandline tool (or Javascript/PHP, but i think commandline is the one way) for render and get the rendered content of URL, but the important its I need to renderer the Javascript not only the CSS/Html/images.

For example command like: "renderengine http://www.google.es outputfile.html" and the content of the web (parsed html and javascript executed) isa saved in outputfile.html.

I need this because i need to take the result of a full javascript website like grooveshark, the site load all using javascript/ajax and the crawlers dont find nothing, only basic HTML empty template (because is loaded after using ajax/javscript)

Exists any browser engine for linux with support to Javascript (for example V8) that output the result for save in files?

like image 730
Zenth Avatar asked Sep 10 '13 13:09

Zenth


1 Answers

  • Selenium : very complete solution with bindings in many languages
  • puppeteer : headless Chrome API, usable in NodeJS or as a command-line tool
  • HTtrack : command-line tool
  • Apache Notch & webmagic : open source Java web crawlers
  • pholcus : "distributed & high concurrency" web crawler written in Go
  • Xvfb a display server implementing the X11 display server protocol, without showing any screen output. I have used it successfully with Travis CI and Protractor as an example. Alternative: XDummy
  • PhantomJS (first suggested by nvuono) : can export the rendered page as non-HTML (pdf, png...). PhantomJS development is suspended until further notice (more details). Closely related: SlimerJS, CasperJS

And there are many Python web scraping libraries:

  • Scrapy
  • pyspider
  • ghost.py
  • splinter
like image 50
Lucas Cimon Avatar answered Oct 15 '22 19:10

Lucas Cimon