What I do is:
- crawl the page
- fetch all links of the page, puts them in a list
- start a new crawler, which visits each links of the list
- download them
There must be a quicker way, where I can download the links directly when I visit the page? Thx!
crawler4j automatically does this process for you. You first add one or more seed pages. These are the pages that are first fetched and processed. crawler4j then extracts all the links in these pages and passes them to your shouldVisit function. If you really want to crawl all of them this function should just return true on all functions. If you only want to crawl pages within a specific domain you can check the URL and return true or false based on that.
Those URLs that your shouldVisit returns true, are then fetched by crawler threads and the same process is performed on them.
The example code here is a good sample for starting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With