I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?
I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.
I only know Python and Java.
Thanks in advance.
Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html
(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start)
If they've created a Web API that their JavaScript interfaces with, you might be able to scrape that directly, rather than trying to go the HTML route.
If they've obfuscated it or that option isn't available for some other reason, you'll basically need a Web browser to evaluate the JavaScript and then scrap the browser's DOM. Perhaps write a browser plugin?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With