How can I make puppeteer follow multiple links in new page instances, to evaluate them in a concurrent and asynchronous way?
It allows us to browse the Internet with a headless browser programmatically. Designed for testing, we'll see how to do web scraping with Puppeteer. Puppeteer allows you to do almost anything that you can do manually in a browser. Visit pages, click links, submit forms, take screenshots, and many more.
The trivial solution is to run a loop that gets the next item, run the browser to interact with the site, and write its result to some collection value. For example, the following code starts the browser, then runs the scraper code sequentially, each job in a new tab: const browser = await puppeteer.
Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.
Mareks solution is fine for a few pages, but in case you want to crawl a large number of pages concurrently, I recommend looking at my library puppeteer-cluster.
It runs tasks in parallel (like Mareks solution), but also takes care of error handling, retrying and some other things. You can see a minimal example below. It's also possible to use the library in more complex settings.
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker
maxConcurrency: 4, // cluster with four workers
});
// Define a task to be executed for your data
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// ...
});
// Queue URLs
cluster.queue('http://www.google.com/');
cluster.queue('http://www.wikipedia.org/');
// ...
// Wait for cluster to idle and close it
await cluster.idle();
await cluster.close();
})();
Almost every Puppeteer method returns a Promise
. So you can use for example https://www.npmjs.com/package/es6-promise-pool package.
First you need create an async function that processes one url:
const crawlUrl = async (url) => {
// Open new tab.
const page = await browser.newPage();
await page.goto(url);
// Evaluate code in a context of page and get your data.
const result = await page.evaluate(() => {
return {
title: document.title,
url: window.location.href,
};
});
results.push(result);
// Close it.
await page.close();
};
Then you need promise producer. Every time this function is called it takes one url from URLS_TO_BE_CRAWLED
constant and returns crawlUrl(url)
promise. Once URLS_TO_BE_CRAWLED
gets empty it returns null
instead which finishes the pool.
const promiseProducer = () => {
const url = URLS_TO_BE_CRAWLED.pop();
return url ? crawlUrl(url) : null;
};
Finally you executes this with CONCURRENCY of your choice:
const pool = new PromisePool(promiseProducer, CONCURRENCY);
await pool.start();
Since this is very often asked question I also made a working example at our Apify platform https://www.apify.com/mtrunkat/puppeteer-promise-pool-example
EDIT 12.10.2018
I would also add that we have recently build whole open source SDK around concurrent scraping with Puppeteer. It solves the main pains such as:
Check it out at: https://github.com/apifytech/apify-js
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With