Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I have multiple Puppeteer browsers open?

I'm using node-cron (which allows you to run cron scripts inside of your node program) to run some puppeteer scraping. The scripts will sometimes run at the same time, meaning there will be multiple instances of the browser, const browser = await puppeteer.launch(), open at once.

Is this bad practice? If so, is there an alternative way of writing this code that won't make it fail?

Thanks for your help.

cron.schedule('*/15 * * * *', async () => {    
    const browser = await pupeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox']});
    const page = await browser.newPage(); // Create new instance of puppet
    let today = moment();        
    logger.info(`Chrome Launched...`);

    try {
        await senatorBot(users, page, today.format("YYYY-DD-MM"));
    } catch(err) {
        logger.debug(JSON.stringify(err));
    }

    try {
        await senateCandidateBot(users, page, today.format("YYYY-DD-MM")); // This sequence matters, because agree statement will not be present...
    } catch(err) {
        logger.debug(JSON.stringify(err));
    }

    await page.close();
    await browser.close();
    logger.info(`Chrome Closed.`);

});

cron.schedule('*/15 17-19 * * 1-5', async () => {   

    logger.info(`Chrome Launched...`); 
    const browser = await pupeteer.launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox']});
    const page = await browser.newPage(); // Create new instance of puppet
    let today = moment();

    try {
        await contractBot(users, page, today.format("MM-DD-YYYY"));
    } catch(err) {
        logger.debug(JSON.stringify(err));
    }

    await page.close();
    await browser.close();
    logger.info(`Chrome Closed.`);
});
like image 368
Harrison Cramer Avatar asked Apr 06 '19 22:04

Harrison Cramer


2 Answers

In general, it is no problem to open two browsers in parallel as long as you have a powerful enough machine. So the answer to this depends entirely on the resources of your machine. Do you have enough memory and CPU to power multiple opened Chrome browsers?

Check if you have enough resources

If you are using linux, open up a tool like htop to check how much memory and CPU is processed when the tasks are run. When you are hitting your CPU/memory limits, you should consider running the tasks sequentially (see below).

Using a pool of resources

Even if you have enough resources you could use the library puppeteer-cluster (disclaimer: I'm the author) to take care of the concurrency handling. The library will also take care of error handling (what if a browser crashes?) and can show you memory, CPU usage and crawling statistics during the run.

Code sample

Here is a minimal example how you could use it.

const { Cluster } = require('puppeteer-cluster');

async function task1({ page }) => { // your first task, page is provided to your task
    await page.goto('...');
    // ...
}

async function task2({ page }) => { // another task
    await page.goto('...');
    // ...
}

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER, // spawn to parallel browsers
        maxConcurrency: 2, // how many tasks should be run concurrently
    });

    cron.schedule('...', () => {
        cluster.queue(task1);
    });

    cron.schedule('...', () => {
        cluster.queue(task2);
    });
})();

Crawl sequentially

If your machine does not have the resources to have two running browsers, you could also run the tasks one after another, you would only have to set the value of maxConcurrency to 1. Then the queued tasks will not be run in parallel, but sequentially as there is only one open resource.

like image 95
Thomas Dondorf Avatar answered Sep 25 '22 23:09

Thomas Dondorf


Ok, found the bits of code to get you started. The code is tied up into my custom code base, but the functions I'm using can easily be replaced by your own.

So first, I write a simple node file that creates an instance of Chromium and save a reference to the wsEndpoint that we can then later use to connect with.

file: chromiumLauncher.js

const writeText = require("mylib/core.io.file/write-text");
const puppeteer = require("puppeteer");
const path = require("path");
const common = require("./common");

(async () => {
  const launch_options = {
    args: ['--disable-features=site-per-process'],
    headless: false,
    devtools: false,
    defaultViewport: {width: 1200, height: 1000},
    userDataDir: common.userDataDir
  };
  const browser = await puppeteer.launch(launch_options);
  const wsEndpoint = browser.wsEndpoint();
  await writeText(common.fnSettings, JSON.stringify({wsEndpoint}, null, "  "));
})();

In the above common.js is just where I store some simple config settings, you can replace with your own, it just simple stores some paths, It's just to store where pupperteer places it's data files and where to save the wsEndpoint value. And write-text is just a simple promise based function for writing text files, basically fs.writeFile with encoding set to utf-8.

Next we just create another js file called connect,

const puppeteer = require("puppeteer");
const cp = require('child_process');
const delay = require("mylib/promise/delay");
let browser = null;

const readText = require("mylib/core.io.file/read-text");
const common = require("./common");


async function launch () {
  cp.spawn('node', ['chromiumLauncher.js'], {
    detached: true,
    shell: true,
    cwd: __dirname
  });
  await delay(5000); //lets wait 5 seconds
}

async function getSettings() {
  try {
    const settingsTxt = await readText(common.fnSettings);
    return JSON.parse(settingsTxt);
  } catch (e) {
    if (e.code !== 'ENOENT') throw e;
    return null;
  }
}


async function connect () {
  if (browser) return browser;
  let settings = await getSettings();
  if (!settings) {
    await launch();
    settings = await getSettings();
  }
  try {
    browser = await puppeteer.connect({browserWSEndpoint: settings.wsEndpoint});
  } catch (e) {
    const err = e.error || e;
    if (err.code === "ECONNREFUSED") {
      console.log("con ref");
      await launch();
      settings = await getSettings();
      browser = await puppeteer.connect({browserWSEndpoint: settings.wsEndpoint});
    }
  }
  return browser;
}


module.exports = connect;

Again a couple of custom library function int the above, but should be simple to replace. read-text, just the opposite of write-text, and delay just a simple promise based delay.

And that's it, to use ..

const connect = require("path-to/connect");
const browser = await connect();
const page = await browser.newPage();

And because we start Chromium detached, as processes close / connect, it will keep open between. I've had about 7 processes connected with 70 webpages open in Chromium without any issues. One thing to note, because I do start chromium inside a detached spawn, you are left to manually close chromium if you need too,. Another option is just starting chromiumLauncher.js in some process manager like PM2 https://www.npmjs.com/package/pm2,

like image 23
Keith Avatar answered Sep 26 '22 23:09

Keith