Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Puppeteer don't return all requests from a particular website

I have the following code that extract all request from a particular website (Get all images, css, scripts, fonts...)

var totalRequests = 0;

    puppeteer.launch().then(async browser => {
        const page = await browser.newPage();
        await page.setRequestInterception(true);
        page.on('request', interceptedRequest => {
            interceptedRequest.continue();
        });
        page.on('response', response => {
            totalRequests = totalRequests + 1;
            console.log('Url: ' + response.url());
        });
        await page.goto('https://stackoverflow.com');
        await browser.close().then(() => {
            res.send('Requests: ' + totalRequests);
        });

Great, in console I can see all the Url's from stackoverflow.com (Css Files, Image Files, Font Files and Javascript Files) and a total files that was requested (In this case I see 31 requests), but for some reason I ended up noting that this code does not return all page requests.

If we go to Google Chrome, press F12, go to network section and reload the page (consider that you are on the url https://stackoverflow.com), we will see a total of 39-40 requests files.

The problem is that my code is only returning 30 to 31 requests, and in console I can't see all links that was shown like on Chrome. What might be happening ? And what could I do to return all requests like as shown in Google Chrome?

like image 643
Sudo Sur Avatar asked Sep 20 '25 10:09

Sudo Sur


1 Answers

Main issue

For stackoverflow.com the number of loaded resources depends on the size of your browser window. If your viewport allows it, stackoverflow will show you ads (on the right sidebar). But the corresponding resources (JavaScript, images, etc.) will only be loaded if the size of your viewport is wide enough. Try this out yourself by reducing the size of your window until the right sidebar is hidden and then reload the page. The DevTools will show you a different number of loaded resources.

Example code

The following example shows how to simulate a larger browser window by setting the defaultViewport property. Note, that I'm fully using async/await syntax in this example and I removed the page.setRequestInterception call as the response event will also be triggered without it (you only need to use it if you really want to modify the request or response).

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({ // headless: false, devtools: true,
        defaultViewport: { width: 1600, height: 800 }
    });
    const page = await browser.newPage();

    var totalRequests = 0;
    page.on('response', () => {
        totalRequests = totalRequests + 1;
    });

    await page.goto('https://stackoverflow.com');
    console.log(totalRequests);

    await browser.close();
})();

This returns 30 for me, which is still not the number we are expecting (~40).

Waiting until all resources are loaded

There is another problem with your code. Let's open the DevTools to see what is happening. If you look at the waterfall diagram in your network tab, it looks like this:

Waterfall Diagram in Chrom DevTools

See that red line? This is the load event. By default page.goto waits for this event. But in our case there are actually a few files that are being loaded after the event is fired (the files to the right of the red line). To also wait for these resources to load, we can use one of the options of the page.goto function. Using waitUntil: 'networkidle0' the script will wait until there is no more network activity.

So if you switch out the page.goto line from the top with this line, you should see the expected number of requests:

await page.goto('https://stackoverflow.com', { waitUntil: 'networkidle0' });

When using this setting, the code from above returned 39 for me, which is what you are expecting.

like image 178
Thomas Dondorf Avatar answered Sep 23 '25 01:09

Thomas Dondorf