Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get all Images from WebPage Program | Java

Currently I need a program that given a URL, returns a list of all the images on the webpage.

ie:

logo.png gallery1.jpg test.gif

Is there any open source software available before I try and code something?

Language should be java. Thanks Philip

like image 990
Phil Avatar asked Jan 31 '10 18:01

Phil


3 Answers

Just use a simple HTML parser, like jTidy, and then get all elements by tag name img and then collect the src attribute of each in a List<String> or maybe List<URI>.

You can obtain an InputStream of an URL using URL#openStream() and then feed it to any HTML parser you like to use. Here's a kickoff example:

InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();

for (int i = 0; i < imgs.getLength(); i++) {
    srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}

for (String src: srcs) {
    System.out.println(src);
}

I must however admit that HtmlUnit as suggested by Bozho indeed looks better.

like image 174
BalusC Avatar answered Sep 20 '22 00:09

BalusC


HtmlUnit has HtmlPage.getElementsByTagName("img"), which will probably suit you.

(read the short Get started guide to see how to obtain the correct HtmlPage object)

like image 24
Bozho Avatar answered Sep 20 '22 00:09

Bozho


This is dead simple with HTML Parser (and any other decent HTML parser):

Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));

for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
    Tag tag = (Tag) iterator.nextNode();
    System.out.println(tag.getAttribute("src"));
}
like image 20
Pascal Thivent Avatar answered Sep 23 '22 00:09

Pascal Thivent