How can I make puppeteer follow multiple links in new page instances, to evaluate them in a concurrent and asynchronous way?

Mareks solution is fine for a few pages, but in case you want to crawl a large number of pages concurrently, I recommend looking at my library puppeteer-cluster. It runs tasks in parallel (like Mareks solution), but also takes care of error handling, retrying and some other things. You can see a minimal example below. It's also possible to use the library in more complex settings. <pre class="prettyprint lang-js prettyprint-override"><code>const { Cluster } = require('puppeteer-cluster'); (async () => { const cluster = await Cluster.launch({ concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker maxConcurrency: 4, // cluster with four workers }); // Define a task to be executed for your data await cluster.task(async ({ page, data: url }) => { await page.goto(url); const screen = await page.screenshot(); // ... }); // Queue URLs cluster.queue('http://www.google.com/'); cluster.queue('http://www.wikipedia.org/'); // ... // Wait for cluster to idle and close it await cluster.idle(); await cluster.close(); })(); </code></pre>

Almost every Puppeteer method returns a <code>Promise</code>. So you can use for example https://www.npmjs.com/package/es6-promise-pool package. First you need create an async function that processes one url: <pre class="prettyprint"><code>const crawlUrl = async (url) => { // Open new tab. const page = await browser.newPage(); await page.goto(url); // Evaluate code in a context of page and get your data. const result = await page.evaluate(() => { return { title: document.title, url: window.location.href, }; }); results.push(result); // Close it. await page.close(); }; </code></pre> Then you need promise producer. Every time this function is called it takes one url from <code>URLS_TO_BE_CRAWLED</code> constant and returns <code>crawlUrl(url)</code> promise. Once <code>URLS_TO_BE_CRAWLED</code> gets empty it returns <code>null</code> instead which finishes the pool. <pre class="prettyprint"><code>const promiseProducer = () => { const url = URLS_TO_BE_CRAWLED.pop(); return url ? crawlUrl(url) : null; }; </code></pre> Finally you executes this with CONCURRENCY of your choice: <pre class="prettyprint"><code>const pool = new PromisePool(promiseProducer, CONCURRENCY); await pool.start(); </code></pre> Since this is very often asked question I also made a working example at our Apify platform https://www.apify.com/mtrunkat/puppeteer-promise-pool-example <hr> EDIT 12.10.2018 I would also add that we have recently build whole open source SDK around concurrent scraping with Puppeteer. It solves the main pains such as: <ul> <li>autoscaling concurrency based on CPU and memory</li> <li>failed request retries using request queue</li> <li>rotation of browsers (to switch proxies)</li> </ul> Check it out at: https://github.com/apifytech/apify-js

Concurrent page scraping with Puppeteer

2 Answers

Mareks solution is fine for a few pages, but in case you want to crawl a large number of pages concurrently, I recommend looking at my library puppeteer-cluster.

It runs tasks in parallel (like Mareks solution), but also takes care of error handling, retrying and some other things. You can see a minimal example below. It's also possible to use the library in more complex settings.

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker
    maxConcurrency: 4, // cluster with four workers
  });

  // Define a task to be executed for your data
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot();
    // ...
  });

  // Queue URLs
  cluster.queue('http://www.google.com/');
  cluster.queue('http://www.wikipedia.org/');
  // ...

  // Wait for cluster to idle and close it
  await cluster.idle();
  await cluster.close();
})();

145

answered Sep 30 '22 05:09

Thomas Dondorf

Almost every Puppeteer method returns a Promise. So you can use for example https://www.npmjs.com/package/es6-promise-pool package.

First you need create an async function that processes one url:

const crawlUrl = async (url) => {
    // Open new tab.
    const page = await browser.newPage();
    await page.goto(url);

    // Evaluate code in a context of page and get your data.
    const result = await page.evaluate(() => {
        return {
            title: document.title,
            url: window.location.href,
        };
    });
    results.push(result);

    // Close it.
    await page.close();
};

Then you need promise producer. Every time this function is called it takes one url from URLS_TO_BE_CRAWLED constant and returns crawlUrl(url) promise. Once URLS_TO_BE_CRAWLED gets empty it returns null instead which finishes the pool.

const promiseProducer = () => {
    const url = URLS_TO_BE_CRAWLED.pop();

    return url ? crawlUrl(url) : null;
};

Finally you executes this with CONCURRENCY of your choice:

const pool = new PromisePool(promiseProducer, CONCURRENCY);
await pool.start();

Since this is very often asked question I also made a working example at our Apify platform https://www.apify.com/mtrunkat/puppeteer-promise-pool-example

EDIT 12.10.2018

I would also add that we have recently build whole open source SDK around concurrent scraping with Puppeteer. It solves the main pains such as:

autoscaling concurrency based on CPU and memory
failed request retries using request queue
rotation of browsers (to switch proxies)

Check it out at: https://github.com/apifytech/apify-js

answered Sep 30 '22 06:09

Marek Trunkát

Related questions
                            
                                how to listen on any port on heroku when a web port 80 is already in use?
                            
                                Change font-size When Selection is made
                            
                                Parsing a little endian hex string to decimal
                            
                                router-view content not rendering
                            
                                VueJs dynamic v-on event possible?
                            
                                JavaScript replace all ignoring case sensitivity
                            
                                time scatter plot w/ chart.js
                            
                                Failed prop type: Invalid prop 'value' of type 'object' supplied to 'TextInput' React Native
                            
                                How does one get the most recent call for a mock function in jest?
                            
                                assign a variable in handlebars
                            
                                how to show and hide columns of using datatable jquery
                            
                                Javascript Keycode for a: 65 or 97?
                            
                                ES6 array map doesn't return anything: ReactJS
                            
                                Express, diff between route.use() ,route.all(),route.route()
                            
                                How to add text above a marker on Google Maps in JS?
                            
                                Angular4 - Change style of header on scroll past 300px
                            
                                Get current week
                            
                                d3 v4 geo draws boundary inverted
                            
                                Register User Through Passport Js
                            
                                How can I detect width and height of the webcamera?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concurrent page scraping with Puppeteer

Tags:

javascript

puppeteer

malkomich

People also ask

2 Answers

Thomas Dondorf

Marek Trunkát

Recent Activity

Donate For Us