I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID
and extract the HTML titles / other stuff in their DOM trees.
Are there ways other than web scraping?
Yes. There are many powerful Java libraries used for web scraping. Two such examples are JSoup and HtmlUnit. These libraries help you connect to a web page and offer many methods to extract the desired information.
It is one of the most suited tools for building low-latency, scalable and optimized web crawling solutions in Java and also is perfect to serve streams of URLs for crawling. Its unique features include: It is a highly scalable Java web crawler and can be used for big-scale recursive crawls.
Python is the most popular language for web scraping. It is a complete product because it can handle almost all processes related to data extraction smoothly.
Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.
You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation
It's a good library and I've used it in my last projects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With