I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some <code>pageID</code> and extract the HTML titles / other stuff in their DOM trees. Are there ways other than web scraping?

<h3>jsoup</h3> Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup. You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation It's a good library and I've used it in my last projects.

Web scraping with Java

Tags:

java

frameworks

web-scraping

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

957

asked Jul 08 '10 09:07

NoneType

1 Answers

jsoup

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.

You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation

It's a good library and I've used it in my last projects.

answered Sep 28 '22 01:09

Wajdy Essam

Related questions
                            
                                Maven release plugin fails : source artifacts getting deployed twice
                            
                                Java map.get(key) - automatically do put(key) and return if key doesn't exist?
                            
                                How can I get Java 11 run-time environment working since there is no more JRE 11 for download?
                            
                                Unit testing with MongoDB
                            
                                No log4j2 configuration file found. Using default configuration: logging only errors to the console
                            
                                Using BigDecimal to work with currencies
                            
                                JPA entity without id
                            
                                How to read and understand the java stack trace? [duplicate]
                            
                                exception in initializer error in java when using Netbeans
                            
                                How to convert a List to variable argument parameter java
                            
                                Side effects of throwing an exception inside a synchronized clause?
                            
                                How to redirect verbose garbage collection output to a file?
                            
                                Easier DynamoDB local testing
                            
                                How can I pass an Integer class correctly by reference?
                            
                                Java: is there an easy way to select a subset of an array?
                            
                                How to calculate elapsed time from now with Joda-Time?
                            
                                log4j not printing the stacktrace for exceptions
                            
                                How can I create a Java method that accepts a variable number of arguments?
                            
                                How to debug a multi-threaded app in IntelliJ?
                            
                                Android download binary file problems

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With