I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.
Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.
Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?
I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.
Thanks for your time.
Ian
Whether it's a web or mobile application, JavaScript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.
Because it's one of the most widely used and supported programming languages, JavaScript scraping allows developers to scrape a wide variety of websites. Plus, JavaScript's native database structure is JSON, which is the most commonly used database structure across all APIs.
You may consider using HTMLunit It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).
Since you say that no AJAX is used, then all the info is present at the HTML source. The javascript just renders it based on user clicks. So you need to reverse engineer the way the application works, parse the html and the javascript code and extract the useful information. It is strictly business of text parsing - you shouldn't deal with running javascript and producing a new DOM. This would be much more difficult to do.
If AJAX was used, your job would be easier. You could easily find out how the AJAX services work (probably by receiving JSON and XML) and extract the information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With