Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process large number of requests with promise all

I have about 5000 links and I need to crawl all those. So Im wonder is there a better approach than this. Here is my code.

let urls = [ 5000 urls go here ];

const doms = await getDoms(urls);

// processing and storing the doms

getDoms = (urls) => {

  let data = await Promise.all(urls.map(url => {
    return getSiteCrawlPromise(url)
  }));
  return data;

}

getSiteCrawlPromise = (url) => {

  return new Promise((resolve, reject) => {
    let j = request.jar();
    request.get({url: url, jar: j}, function(err, response, body) {
        if(err)
          return resolve({ body: null, jar: j, error: err});
        return resolve({body: body, jar: j, error: null});
    });
  })

} 

Is there a mechanism implemented in promise so it can devide the jobs to multiple threads and process. then return the output as a whole ? and I don't want to devide the urls into smaller fragments and process those fragments

like image 788
NuOne Avatar asked Sep 17 '25 23:09

NuOne


2 Answers

The Promise object represents the eventual completion (or failure) of an asynchronous operation, and its resulting value.

There is no in-built mechanism in Promises to "divide jobs into multiple threads and process". If you must do that, you'll have to fragment the urls array into smaller arrays and queue the fragmented arrays onto separate crawler instances simultaneously.

But, there is absolutely no need to go that way, since you're using node-js and node-crawler, you can use the maxConnections option of the node-crawler. This is what it was built for and the end result would be the same. You'll be crawling the urls on multiple threads, without wasting time and effort on manual chunking and handling of multiple crawler instances, or depending on any concurrency libraries.

like image 128
PrashanD Avatar answered Sep 19 '25 15:09

PrashanD


There isn't such a mechanism built-in to Javascript, at least right now.

You can use third-party Promise libraries that offer more features, like Bluebird, in which you can make use of their concurrency feature:

const Promise = require('bluebird');

// Crawl all URLs, with 10 concurrent "threads".
Promise.map(arrayOfUrls, url => {
    return /* promise for crawling the url */;
}, { concurrency: 10 });

Another option is to use a dedicated throttling library (I recommend highly bottleneck), which lets you express any generic kind of rate limit. The syntax in that case would be similar to what you already have:

const Bottleneck = require('bottleneck');
const limit = new Bottleneck({ maxConcurrent: 10 });

const getCallSitePromise = limit.wrap(url => {
    // the body of your getCallSitePromise function, as normal
});

// getDoms stays exactly the same

You can solve this problem yourself, but bringing one (or both!) of the libraries above will save you a lot of code.

like image 36
Elliot Nelson Avatar answered Sep 19 '25 14:09

Elliot Nelson