Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrent page scraping with Puppeteer

How can I make puppeteer follow multiple links in new page instances, to evaluate them in a concurrent and asynchronous way?

like image 929
malkomich Avatar asked Dec 06 '17 13:12

malkomich


People also ask

Is puppeteer good for scraping?

It allows us to browse the Internet with a headless browser programmatically. Designed for testing, we'll see how to do web scraping with Puppeteer. Puppeteer allows you to do almost anything that you can do manually in a browser. Visit pages, click links, submit forms, take screenshots, and many more.

How do you speed up puppeteer scraping?

The trivial solution is to run a loop that gets the next item, run the browser to interact with the site, and write its result to some collection value. For example, the following code starts the browser, then runs the scraper code sequentially, each job in a new tab: const browser = await puppeteer.

Is Node JS good for scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.


2 Answers

Mareks solution is fine for a few pages, but in case you want to crawl a large number of pages concurrently, I recommend looking at my library puppeteer-cluster.

It runs tasks in parallel (like Mareks solution), but also takes care of error handling, retrying and some other things. You can see a minimal example below. It's also possible to use the library in more complex settings.

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker
    maxConcurrency: 4, // cluster with four workers
  });

  // Define a task to be executed for your data
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot();
    // ...
  });

  // Queue URLs
  cluster.queue('http://www.google.com/');
  cluster.queue('http://www.wikipedia.org/');
  // ...

  // Wait for cluster to idle and close it
  await cluster.idle();
  await cluster.close();
})();
like image 145
Thomas Dondorf Avatar answered Sep 30 '22 05:09

Thomas Dondorf


Almost every Puppeteer method returns a Promise. So you can use for example https://www.npmjs.com/package/es6-promise-pool package.

First you need create an async function that processes one url:

const crawlUrl = async (url) => {
    // Open new tab.
    const page = await browser.newPage();
    await page.goto(url);

    // Evaluate code in a context of page and get your data.
    const result = await page.evaluate(() => {
        return {
            title: document.title,
            url: window.location.href,
        };
    });
    results.push(result);

    // Close it.
    await page.close();
};

Then you need promise producer. Every time this function is called it takes one url from URLS_TO_BE_CRAWLED constant and returns crawlUrl(url) promise. Once URLS_TO_BE_CRAWLED gets empty it returns null instead which finishes the pool.

const promiseProducer = () => {
    const url = URLS_TO_BE_CRAWLED.pop();

    return url ? crawlUrl(url) : null;
};

Finally you executes this with CONCURRENCY of your choice:

const pool = new PromisePool(promiseProducer, CONCURRENCY);
await pool.start();

Since this is very often asked question I also made a working example at our Apify platform https://www.apify.com/mtrunkat/puppeteer-promise-pool-example


EDIT 12.10.2018

I would also add that we have recently build whole open source SDK around concurrent scraping with Puppeteer. It solves the main pains such as:

  • autoscaling concurrency based on CPU and memory
  • failed request retries using request queue
  • rotation of browsers (to switch proxies)

Check it out at: https://github.com/apifytech/apify-js

like image 44
Marek Trunkát Avatar answered Sep 30 '22 06:09

Marek Trunkát