Screen Scraping from a web page with a lot of Javascript [closed]

Tags:

I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.

Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.

Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?

I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.

Thanks for your time.

Ian

951

asked May 13 '09 11:05

IanW

2 Answers

You may consider using HTMLunit It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

109

answered Sep 28 '22 06:09

em70

Since you say that no AJAX is used, then all the info is present at the HTML source. The javascript just renders it based on user clicks. So you need to reverse engineer the way the application works, parse the html and the javascript code and extract the useful information. It is strictly business of text parsing - you shouldn't deal with running javascript and producing a new DOM. This would be much more difficult to do.

If AJAX was used, your job would be easier. You could easily find out how the AJAX services work (probably by receiving JSON and XML) and extract the information.

answered Sep 28 '22 05:09

kgiannakakis

Related questions
                            
                                Socket.IO clients to receive offline messages when coming back
                            
                                Javascript: Which parameters are there for the onerror event with image objects? How to get additional details on an error related to an image?
                            
                                Why does S3.deleteObject not fail when the specified key doesn't exist?
                            
                                How would you do file uploads in a React-Relay app?
                            
                                How to extend/overwrite default options in Angular Material $mdDialog.show?
                            
                                In Vue.js, is there a way to keep component templates out of JavaScript strings?
                            
                                trouble unit testing reactive form fields in angular
                            
                                Stripe: Must provide source or customer
                            
                                Add video to user's "Watch Later" playlist on YouTube
                            
                                Does Visual Studio 2017 support a _references.js file?
                            
                                When am I supposed to call compileComponents, and how am I getting away with not doing so?
                            
                                ES6 Classes - Updating Static Properties
                            
                                When to use ":"(colon) operator in javascript vs "=" operator?
                            
                                RxJS Subscriber unsubscribe vs. complete
                            
                                shorthand property name with *this*
                            
                                {} || [] is not valid JavaScript [duplicate]
                            
                                Variables in graphQL queries
                            
                                Jest test fails with Unexpected token, expected ";"
                            
                                How To Solve The React Hook Closure Issue?
                            
                                Vector graphics in Javascript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Screen Scraping from a web page with a lot of Javascript [closed]

Tags:

javascript

html

dom

screen-scraping

IanW

People also ask

2 Answers

em70

kgiannakakis

Recent Activity

Donate For Us