<p>I'm using puppeteer for scraping some pages, but I'm curious about how to manage this in production for a node app. I'll be scraping up to 500,000 pages in a day, but these scrape jobs will happen at random intervals, so it's not a single queue that I can plow through. </p> <p>What I'm wondering is, is it better to open a browser, go to the page, then close the browser between each job? Which I would assume would be a lot slower, but maybe handle memory better? </p> <p>Or do I open one global browser when the app boots, and then just go to the page, and have some way to dump that page when I'm done with it (e.g. closing all tabs in chrome, but not closing chrome) then just re-open a new page when I need it? This way seems like it would be faster, but could potentially eat up lots of memory.</p> <p>I've never worked with this library especially in a production environment, so I'm not sure if there's things I should watch out for.</p>

<p>You probably want to create a pool of multiple Chromium instances with independent browsers. The advantage of that is, when one browser crashes all other jobs can keep running. The advantage of one browser (with multiple pages) is a slight memory and CPU advantage and the cookies are shared between your pages.</p> <h3>Pool of puppeteer instances</h3> <p>The library puppteer-cluster (disclaimer: I'm the author) creates a pool of browsers or pages for you. It takes care of the creation, error handling, browser restarting, etc. for you. So you can simply queue jobs/URLs and the library takes care of everything else.</p> <h3>Code sample</h3> <pre class="prettyprint lang-js prettyprint-override"><code>const { Cluster } = require('puppeteer-cluster'); (async () => { const cluster = await Cluster.launch({ concurrency: Cluster.CONCURRENCY_BROWSER, // use one browser per worker maxConcurrency: 4, // cluster with four workers }); // Define a task to be executed for your data (put your "crawling code" in here) await cluster.task(async ({ page, data: url }) => { await page.goto(url); // ... }); // Queue URLs when the cluster is created cluster.queue('http://www.google.com/'); cluster.queue('http://www.wikipedia.org/'); // Or queue URLs anytime later setTimeout(() => { cluster.queue('http://...'); }, 1000); })(); </code></pre> <p>You can also queue functions directly in case you have different task to do. Normally you would close the cluster after you are finished via <code>cluster.close()</code>, but you are free to just let it stay open. You find another example for a cluster that gets data when a request comes in in the repository.</p>

Managing puppeteer for memory and performance

Tags:

node.js

web-scraping

puppeteer

I'm using puppeteer for scraping some pages, but I'm curious about how to manage this in production for a node app. I'll be scraping up to 500,000 pages in a day, but these scrape jobs will happen at random intervals, so it's not a single queue that I can plow through.

What I'm wondering is, is it better to open a browser, go to the page, then close the browser between each job? Which I would assume would be a lot slower, but maybe handle memory better?

Or do I open one global browser when the app boots, and then just go to the page, and have some way to dump that page when I'm done with it (e.g. closing all tabs in chrome, but not closing chrome) then just re-open a new page when I need it? This way seems like it would be faster, but could potentially eat up lots of memory.

I've never worked with this library especially in a production environment, so I'm not sure if there's things I should watch out for.

711

asked Aug 22 '18 16:08

jeremywoertink

1 Answers

You probably want to create a pool of multiple Chromium instances with independent browsers. The advantage of that is, when one browser crashes all other jobs can keep running. The advantage of one browser (with multiple pages) is a slight memory and CPU advantage and the cookies are shared between your pages.

Pool of puppeteer instances

The library puppteer-cluster (disclaimer: I'm the author) creates a pool of browsers or pages for you. It takes care of the creation, error handling, browser restarting, etc. for you. So you can simply queue jobs/URLs and the library takes care of everything else.

Code sample

const { Cluster } = require('puppeteer-cluster');  (async () => {     const cluster = await Cluster.launch({         concurrency: Cluster.CONCURRENCY_BROWSER, // use one browser per worker         maxConcurrency: 4, // cluster with four workers     });      // Define a task to be executed for your data (put your "crawling code" in here)     await cluster.task(async ({ page, data: url }) => {         await page.goto(url);         // ...     });      // Queue URLs when the cluster is created     cluster.queue('http://www.google.com/');     cluster.queue('http://www.wikipedia.org/');      // Or queue URLs anytime later     setTimeout(() => {         cluster.queue('http://...');     }, 1000); })();

You can also queue functions directly in case you have different task to do. Normally you would close the cluster after you are finished via cluster.close(), but you are free to just let it stay open. You find another example for a cluster that gets data when a request comes in in the repository.

answered Sep 19 '22 08:09

Thomas Dondorf

Related questions
                            
                                how do you link css to a jade file?
                            
                                In node.js "request.on" what is it this ".on"
                            
                                How to properly pass mysql connection to routes with express.js
                            
                                Difference between response.status() vs. response.sendStatus() in express
                            
                                JavaScript classes with getter and setter cause RangeError: Maximum call stack size exceeded
                            
                                Node.js can't authenticate to MySQL 8.0
                            
                                Validate array of objects in express validator
                            
                                Node.js workers/background processes
                            
                                How do I open a terminal application from node.js?
                            
                                Mongoose unique validation error type
                            
                                Restrict access to Node.js-based HTTP server by IP address
                            
                                DynamoDB : SET list_append not working using aws sdk
                            
                                Grouping routes in Express
                            
                                Nodejs - Joi Check if string is in a given list
                            
                                List of Node.js HTTP status codes
                            
                                How to intercept node.js express request
                            
                                cannot locate .npmrc file after installing nodejs and npm on ubuntu 12.04
                            
                                Running node from a bash script
                            
                                Sinon.Stub in Node with AWS-SDK
                            
                                npm install resulting in 'ENOENT: no such file or directory'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With