I have about 5000 links and I need to crawl all those. So Im wonder is there a better approach than this. Here is my code.
let urls = [ 5000 urls go here ];
const doms = await getDoms(urls);
// processing and storing the doms
getDoms = (urls) => {
let data = await Promise.all(urls.map(url => {
return getSiteCrawlPromise(url)
}));
return data;
}
getSiteCrawlPromise = (url) => {
return new Promise((resolve, reject) => {
let j = request.jar();
request.get({url: url, jar: j}, function(err, response, body) {
if(err)
return resolve({ body: null, jar: j, error: err});
return resolve({body: body, jar: j, error: null});
});
})
}
Is there a mechanism implemented in promise so it can devide the jobs to multiple threads and process. then return the output as a whole ? and I don't want to devide the urls into smaller fragments and process those fragments
The Promise object represents the eventual completion (or failure) of an asynchronous operation, and its resulting value.
There is no in-built mechanism in Promises to "divide jobs into multiple threads and process". If you must do that, you'll have to fragment the urls array into smaller arrays and queue the fragmented arrays onto separate crawler instances simultaneously.
But, there is absolutely no need to go that way, since you're using node-js and node-crawler, you can use the maxConnections
option of the node-crawler. This is what it was built for and the end result would be the same. You'll be crawling the urls on multiple threads, without wasting time and effort on manual chunking and handling of multiple crawler instances, or depending on any concurrency libraries.
There isn't such a mechanism built-in to Javascript, at least right now.
You can use third-party Promise libraries that offer more features, like Bluebird, in which you can make use of their concurrency feature:
const Promise = require('bluebird');
// Crawl all URLs, with 10 concurrent "threads".
Promise.map(arrayOfUrls, url => {
return /* promise for crawling the url */;
}, { concurrency: 10 });
Another option is to use a dedicated throttling library (I recommend highly bottleneck), which lets you express any generic kind of rate limit. The syntax in that case would be similar to what you already have:
const Bottleneck = require('bottleneck');
const limit = new Bottleneck({ maxConcurrent: 10 });
const getCallSitePromise = limit.wrap(url => {
// the body of your getCallSitePromise function, as normal
});
// getDoms stays exactly the same
You can solve this problem yourself, but bringing one (or both!) of the libraries above will save you a lot of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With