autogenerate HTTP screen scraping Java code

Tags:

I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.

As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.

Thanks, Don

629

asked Jan 08 '09 01:01

Dónal

2 Answers

I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.

It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.

If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.

183

answered Sep 29 '22 17:09

j pimmel

The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.

answered Sep 29 '22 19:09

Nicholas

Related questions
                            
                                How to get the maximum of two ZonedDateTime instances?
                            
                                Error: Activity class {.MainActivity} does not exist
                            
                                Is caching of boxed Byte objects not required by Java 13 SE spec?
                            
                                Java8 maven raise Error reference to filter is ambiguous
                            
                                Defaulting Optional orElse with Optional.empty in Java 8
                            
                                Mapstruct bidirectional mapping
                            
                                Is there a class for encoding a local time of week in Java?
                            
                                Error: java: package org.springframework.boot does not exist
                            
                                JDK directory is not set or invalid (unity)?
                            
                                Build Error Extension not initialized yet, couldn't access compileSdkVersion. Android Studio
                            
                                Java record constructor invisible through reflection
                            
                                What goes into the "other" section of java's native memory tracking output?
                            
                                How to prioritize a certain request in java?
                            
                                Is it necessary to set ViewBinding to null in Fragment's onDestroy()?
                            
                                How to generate a random time between two times say 4PM and 2AM?
                            
                                How to handle functions deprecation in library?
                            
                                How to decompress a gzipped data in a byte array?
                            
                                Can we use JMX for Alerts/Notification
                            
                                threadlocal variables in a servlet
                            
                                Can my build stipulate that my code coverage never get worse?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

autogenerate HTTP screen scraping Java code

Tags:

java

http

selenium

screen-scraping

Dónal

People also ask

2 Answers

j pimmel

Nicholas

Recent Activity

Donate For Us