Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get crawler4j to download all links from a page more quickly?

Tags:

java

crawler4j

What I do is:
- crawl the page
- fetch all links of the page, puts them in a list
- start a new crawler, which visits each links of the list
- download them

There must be a quicker way, where I can download the links directly when I visit the page? Thx!

like image 785
seinecle Avatar asked Jan 10 '12 14:01

seinecle


1 Answers

crawler4j automatically does this process for you. You first add one or more seed pages. These are the pages that are first fetched and processed. crawler4j then extracts all the links in these pages and passes them to your shouldVisit function. If you really want to crawl all of them this function should just return true on all functions. If you only want to crawl pages within a specific domain you can check the URL and return true or false based on that.

Those URLs that your shouldVisit returns true, are then fetched by crawler threads and the same process is performed on them.

The example code here is a good sample for starting.

like image 53
Yasser Avatar answered Sep 29 '22 11:09

Yasser