Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Puppeteer, how can I open a page, get the data, then go back to the previous page to get the next page on the list?

SITUATION:

Here is what I want to do:

1) I load page 0. Page 0 contains clickable links to different pages. I want to load the content of all those pages. So:

2) Click on the first link. Load page 1. Get Data. Go back to the previous page (Page 0)

3) Click on the second link which loads page 2. Etc.. ad infinitum until all links have been clicked.

With my current code, page 0 loads, then the first link is clicked and loads page 1, then there is a crash with the following error:

(node:2629) UnhandledPromiseRejectionWarning: Error: Protocol error (Runtime.callFunctionOn): Execution context was destroyed.

QUESTION:

What am I doing wrong and how can I make my script behave the way I intended ?


CODE:

const puppeteer = require('puppeteer');
const fs = require('fs');

let getData = async () => {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('url', { waitUntil: 'networkidle2' });
    await page.setViewport({width: ..., height:...});

    const result = await page.evaluate(async () => {
        let data = []; 
        let elements = document.querySelector('.items').querySelectorAll('.item'); 

        for (const element of elements) {

            element.click();
            await new Promise((resolve) => setTimeout(resolve, 2000));

            // GETTING THE DATA THEN PUSHING IT INTO THE DATA ARRAY

            await page.goBack();
        }

        return data; // Return our data array

    });

    browser.close();
    return result; // Return the data
};
like image 589
TheProgrammer Avatar asked Aug 06 '18 11:08

TheProgrammer


4 Answers

OK here's my take on this. Firstly, you're using the evaluate method incorrectly. Mainly because you don't actually need it but also because you're asking it to do something it can't do. Just to explain: the evaluate method operates in the context of your web page only. It pretty much only allows you to execute Javascript instructions directly on the current page in the remote browser. It has no concept of variables that you've declared externally to that function - so in this case, when you do this:

await page.goBack();

The evaluate method has no idea what page is nor how to use it. Now there are ways to inject page into the evaluate method but that won't resolve your problem either. Puppeteer API calls simply won't work inside an evaluate method (I've tried this myself and it always returns an exception).

So now lets get back to the problem you do have - what you're doing in the evaluate function is retrieving one UI element with class .items and then searching for every UI element within that UI element with class .item. You're then looping through all of the found UI elements, clicking on each one, grabbing some kind of data and then going back to click on the next one.

You can achieve all of this without ever using the evaluate method and, instead, using Puppeteer API calls as follows:

const itemsList = await page.$('.items'); // Using '.$' is the puppeteer equivalent of 'querySelector'
const elements = await itemsList.$$('.item'); // Using '.$$' is the puppeteer equivalent of 'querySelectorAll'

const data = [];
elements.forEach(async (element) => {
  await element.click();
  // Get the data you want here and push it into the data array
  await page.goBack();
});

Hope this helps you out!

like image 60
AJC24 Avatar answered Oct 18 '22 05:10

AJC24


Instead of navigating back-and-forth to click the next link from the first page, it would make better sense to store the links from the first page into an array, and then open them one at a time with page.goto().

In other words, you can accomplish this task using the following example:

await page.goto('https://example.com/page-1');

const urls = await page.evaluate(() => Array.from(document.querySelectorAll('.link'), element => element.href));

for (let i = 0, total_urls = urls.length; i < total_urls; i++) {
  await page.goto(urls[i]);

  // Get the data ...
}
like image 36
Grant Miller Avatar answered Oct 18 '22 05:10

Grant Miller


@AJC24's did not work for me. The problem was that the page context was destroyed when clicking in and coming back to the original page.

What I ended up having to do was something similar to what Grant suggested. I collected all of the button identifiers in an array and upon going back to the original page I would click in again.

like image 2
gemart Avatar answered Oct 18 '22 03:10

gemart


By using the iterations from @Grant

Execution context was destroyed, most likely because of a navigation.

Then I make it open a new tab in the iteration and it solved the problem!

for (let i = 0, total_urls = urls.length; i < total_urls; i++) {
  const page = await browser.newPage();
  await page.goto(url), { waitUntil: 'networkidle0', timeout: 0 };

  await page.goto(urls[i]);

  // Get the data ...
}
like image 2
Hunter Avatar answered Oct 18 '22 04:10

Hunter