I have a scraping algorithm in nodejs with puppeteer which scrapes 5 pages concurrently and when it finishes with one page it pulls the next url from a queue and open it in the same page. The CPU is always at 100%. How to make puppeteer use less cpu?
This process is running on a digitaloceans droplet with 4gb of RAM and 2 vCPUs.
I've launched the puppeteer instance with some args to try to make it lighter but nothing happened
puppeteer.launch({
args: ['--no-sandbox', "--disable-accelerated-2d-canvas","--disable-gpu"],
headless: true,
});
Are there any other args I can give to make it less CPU hungry?
I've also blocked images loading
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType().toUpperCase() === 'IMAGE')
request.abort();
else
request.continue();
});
Memory requirements Actors using Puppeteer: at least 1GB of memory. Large and complex sites like Google Maps: at least 4GB for optimal speed and concurrency.
By default, Puppeteer downloads and uses a specific version of Chromium so its API is guaranteed to work out of the box. To use Puppeteer with a different version of Chrome or Chromium, pass in the executable's path when creating a Browser instance: const browser = await puppeteer.
Making Puppeteer Undetectable For puppeteer, there is a stealth plugin that implements a lot of browser stealth tricks. Let's install it and add it to the script. And that's it, it will be very hard to detect the Puppeteer browser now as being a scraping-bot.
my default args, please test it and tell me if this run smoothly.
Please note that --no-sandbox
isn't secure when navigate to vulnerable sites, but it's OK if you're testing your own sites or apps. So make sure, you're know what you're doing.
const options = {
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process', // <- this one doesn't works in Windows
'--disable-gpu'
],
headless: true
}
return await puppeteer.launch(options)
There's a few factors that can play into this. First, check if the site(s) that you're visiting using a lot of CPU. Things like canvas and other scripts can easily chew through your CPU, especially when it comes to using canvas.
If you're using docker to do your deployment then make sure you use dumb-init
. There's a nice repo here that goes into why you'd use such a thing, but essentially the process ID that gets assigned in your docker image has some hiccups when it comes to handling termination:
EXPOSE 8080
ENTRYPOINT ["dumb-init", "--"]
CMD ["yarn", "start"]
This is something I've witnessed and fixed on browserless.io as I use docker to handle deployments, CPU usage being one of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With