Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a simple Java spider [closed]

I need to supply a base URL (such as http://www.wired.com) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?

Thanks.

like image 793
rs79 Avatar asked Feb 04 '11 21:02

rs79


2 Answers

I have used Web Harvest a couple of times, and it is quite good for web scraping.

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Alternatively, you can roll your own web scraper using tools such as JTidy to first convert an HTML document to XHTML, and then processing the information you need with XPath. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com, would be something like //a[contains(@href,'wired')]/@href. You can find some sample code for this approach in this answer to a similar question.

like image 150
João Silva Avatar answered Nov 04 '22 11:11

João Silva


'Simple' is perhaps not a relevant concept here. it's a complex task. I recommend nutch.

like image 24
bmargulies Avatar answered Nov 04 '22 12:11

bmargulies