Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chrome Headless puppeteer too much CPU

I have a scraping algorithm in nodejs with puppeteer which scrapes 5 pages concurrently and when it finishes with one page it pulls the next url from a queue and open it in the same page. The CPU is always at 100%. How to make puppeteer use less cpu?

This process is running on a digitaloceans droplet with 4gb of RAM and 2 vCPUs.

I've launched the puppeteer instance with some args to try to make it lighter but nothing happened

 puppeteer.launch({
    args: ['--no-sandbox', "--disable-accelerated-2d-canvas","--disable-gpu"],
    headless: true,
  });

Are there any other args I can give to make it less CPU hungry?

I've also blocked images loading

await page.setRequestInterception(true);
page.on('request', request => {
  if (request.resourceType().toUpperCase() === 'IMAGE')
    request.abort();
  else
    request.continue();
});
like image 483
Pjotr Raskolnikov Avatar asked Feb 27 '18 11:02

Pjotr Raskolnikov


People also ask

How much RAM does puppeteer need?

Memory requirements Actors using Puppeteer: at least 1GB of memory. Large and complex sites like Google Maps: at least 4GB for optimal speed and concurrency.

Can puppeteer use Chrome instead of Chromium?

By default, Puppeteer downloads and uses a specific version of Chromium so its API is guaranteed to work out of the box. To use Puppeteer with a different version of Chrome or Chromium, pass in the executable's path when creating a Browser instance: const browser = await puppeteer.

Is puppeteer undetectable?

Making Puppeteer Undetectable For puppeteer, there is a stealth plugin that implements a lot of browser stealth tricks. Let's install it and add it to the script. And that's it, it will be very hard to detect the Puppeteer browser now as being a scraping-bot.


Video Answer


2 Answers

my default args, please test it and tell me if this run smoothly. Please note that --no-sandbox isn't secure when navigate to vulnerable sites, but it's OK if you're testing your own sites or apps. So make sure, you're know what you're doing.

  const options = {
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-accelerated-2d-canvas',
      '--no-first-run',
      '--no-zygote',
      '--single-process', // <- this one doesn't works in Windows
      '--disable-gpu'
    ],
    headless: true
  }

  return await puppeteer.launch(options)
like image 70
Edi Imanto Avatar answered Oct 13 '22 18:10

Edi Imanto


There's a few factors that can play into this. First, check if the site(s) that you're visiting using a lot of CPU. Things like canvas and other scripts can easily chew through your CPU, especially when it comes to using canvas.

If you're using docker to do your deployment then make sure you use dumb-init. There's a nice repo here that goes into why you'd use such a thing, but essentially the process ID that gets assigned in your docker image has some hiccups when it comes to handling termination:

EXPOSE 8080

ENTRYPOINT ["dumb-init", "--"]
CMD ["yarn", "start"]

This is something I've witnessed and fixed on browserless.io as I use docker to handle deployments, CPU usage being one of them.

like image 36
browserless Avatar answered Oct 13 '22 19:10

browserless