Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web scraping with Java

I'm not able to find any good web scraping Java based API. The site which I need to scrape does not provide any API as well; I want to iterate over all web pages using some pageID and extract the HTML titles / other stuff in their DOM trees.

Are there ways other than web scraping?

like image 957
NoneType Avatar asked Jul 08 '10 09:07

NoneType


People also ask

Can you web scrape with Java?

Yes. There are many powerful Java libraries used for web scraping. Two such examples are JSoup and HtmlUnit. These libraries help you connect to a web page and offer many methods to extract the desired information.

Is Java good for web crawling?

It is one of the most suited tools for building low-latency, scalable and optimized web crawling solutions in Java and also is perfect to serve streams of URLs for crawling. Its unique features include: It is a highly scalable Java web crawler and can be used for big-scale recursive crawls.

What is the best programming language for web scraping?

Python is the most popular language for web scraping. It is a complete product because it can handle almost all processes related to data extraction smoothly.


1 Answers

jsoup

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.

You can navigate the page using DOM if you know the page structure, see http://jsoup.org/cookbook/extracting-data/dom-navigation

It's a good library and I've used it in my last projects.

like image 85
Wajdy Essam Avatar answered Sep 28 '22 01:09

Wajdy Essam