Currently I need a program that given a URL, returns a list of all the images on the webpage.
ie:
logo.png gallery1.jpg test.gif
Is there any open source software available before I try and code something?
Language should be java. Thanks Philip
Just use a simple HTML parser, like jTidy, and then get all elements by tag name img
and then collect the src
attribute of each in a List<String>
or maybe List<URI>
.
You can obtain an InputStream
of an URL
using URL#openStream()
and then feed it to any HTML parser you like to use. Here's a kickoff example:
InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
for (String src: srcs) {
System.out.println(src);
}
I must however admit that HtmlUnit as suggested by Bozho indeed looks better.
HtmlUnit has HtmlPage.getElementsByTagName("img")
, which will probably suit you.
(read the short Get started guide to see how to obtain the correct HtmlPage
object)
This is dead simple with HTML Parser (and any other decent HTML parser):
Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));
for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
System.out.println(tag.getAttribute("src"));
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With