Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract links from a web page

Tags:

Using Java, how can I extract all the links from a given web page?

like image 722
Wassim AZIRAR Avatar asked Feb 25 '11 16:02

Wassim AZIRAR


People also ask

What is a URL Extractor?

About URL Extractor This tool will extract all URLs from text. It works with all standard links, including with non-English characters if the link includes a trailing / followed by text. This tool extracts all URLs from your text.

What is link extraction?

link extractor tool is used to scan and extract links from HTML of a web page. It is 100% free SEO tools it has multiple uses in SEO works. Some of the most important tasks for which linkextractor is used are below. To find out calculate external and internal link on your webpage.


1 Answers

download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use

File input = new File("/tmp/input.html");  Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");  Elements links = doc.select("a[href]"); // a with href Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png  Element masthead = doc.select("div.masthead").first(); 

and find all links and then get the detials using

String linkhref=links.attr("href"); 

Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax

The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.

EDIT: In case you want more tutorials, you can try out this one made by mkyong.

http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

like image 73
samarjit samanta Avatar answered Oct 12 '22 01:10

samarjit samanta