Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape HTTPS javascript web pages

I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?

I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.

I only know Python and Java.

Thanks in advance.

like image 974
jsj Avatar asked Apr 06 '11 05:04

jsj


2 Answers

Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html

(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start)

like image 115
Piskvor left the building Avatar answered Oct 13 '22 13:10

Piskvor left the building


If they've created a Web API that their JavaScript interfaces with, you might be able to scrape that directly, rather than trying to go the HTML route.

If they've obfuscated it or that option isn't available for some other reason, you'll basically need a Web browser to evaluate the JavaScript and then scrap the browser's DOM. Perhaps write a browser plugin?

like image 27
bradley.ayers Avatar answered Oct 13 '22 14:10

bradley.ayers