I've created scraper using puppeteer & node js (express). The idea is when server received http request then my apps will start scraping the page.
The problem is if my apps receive multiple http request at one time. Scraping process will start over and over again until no http request hit. How do i start only one http request and queue the other request until the first scraping process finish ?
Currently, i've tried node-request-queue with codes below but no lucks.
var express = require("express");
var app = express();
var reload = require("express-reload");
var bodyParser = require("body-parser");
const router = require("./routes");
const RequestQueue = require("node-request-queue");
app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());
var port = process.env.PORT || 8080;
app.use(express.static("public")); // static assets eg css, images, js
let rq = new RequestQueue(1);
rq.on("resolved", res => {})
.on("rejected", err => {})
.on("completed", () => {});
rq.push(app.use("/wa", router));
app.listen(port);
console.log("Magic happens on port " + port);
node-request-queue
is created for request
package, which is different than express
.
You can accomplish the queue using simplest promise queue library p-queue. It has concurrency support and looks much more readable than any other libraries. You can easily switch away from promises to a robust queue like bull
at a later time.
This is how you can create a queue,
const PQueue = require("p-queue");
const queue = new PQueue({ concurrency: 1 });
This is how you can add an async function to queue, it will return resolved data if you listen to it,
queue.add(() => scrape(url));
So instead of adding route to queue, you just remove other lines around it and keep the router as is.
// here goes one route
app.use('/wa', router);
Inside one of your router file,
const routes = require("express").Router();
const PQueue = require("p-queue");
// create a new queue, and pass how many you want to scrape at once
const queue = new PQueue({ concurrency: 1 });
// our scraper function lives outside route to keep things clean
// the dummy function returns the title of provided url
const scrape = require('../scraper');
async function queueScraper(url) {
return queue.add(() => scrape(url));
}
routes.post("/", async (req, res) => {
const result = await queueScraper(req.body.url);
res.status(200).json(result);
});
module.exports = routes;
Make sure to include the queue inside the route, not the other way around. Create only one queue on your routes
file or wherever you are running the scraper .
Here is contents of scraper file, you can use any content you want, this is just an working dummy,
const puppeteer = require('puppeteer');
// a dummy scraper function
// launches a browser and gets title
async function scrape(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const title = await page.title();
await browser.close();
return title
}
module.exports = scrape;
Here is my git repo which have working code with sample queue.
If you use any of such queue, you will notice you have problem dealing with 100 of results at same time and request to your api will keep timing out because there are 99 other url waiting in the queue. That is why you have to learn more about real queue and concurrency at a later time.
Once you understand how queue works, the other answers about cluster-puppeteer, rabbitMQ, bull queue etc, will help you at that time :) .
You can use puppeteer-cluster for that (disclaimer: I'm the author). You can setup a cluster with a pool of only one worker. Therefore, the jobs given to the cluster will be executed one after another.
As you did not say what your puppeteer script should be doing, in this code example I'm extracting the title of a page as an example (given via /wa?url=...
) and providing the result to the response.
// setup the cluster with only one worker in the pool
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 1,
});
// define your task (in this example we extract the title of the given page)
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
return await page.evaluate(() => document.title);
});
// Listen for the request
app.get('/wa', async function (req, res) {
// cluster.execute will run the job with the workers in the pool. As there is only one worker
// in the pool, the jobs will be run sequentially
const result = await cluster.execute(req.query.url);
res.end(result);
});
This is a minimal example. You might want to catch any errors in your listener. For more information check out a more complex example with a screenshot server using express in the repository.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With