Overcoming pagination when using puppeteer (library) for web-scraping

Question

I am using Puppeteer to build a basic web-scraper and so far I can return all the data I require from any given page, however when pagination is involved my scraper comes unstuck (only returning the 1st page).

See example - this returns Title/Price for 1st 20 books, but doesn't look at the other 49 pages of books.

Just looking for guidance on how to overcome this - I can't see anything in the docs.

Thanks!

const puppeteer = require('puppeteer');

let scrape = async () => {
  const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();

await page.goto('http://books.toscrape.com/');

const result = await page.evaluate(() => {
  let data = []; 
  let elements = document.querySelectorAll('.product_pod');

  for (var element of elements){
      let title = element.childNodes[5].innerText;
      let price = element.childNodes[7].children[0].innerText;

      data.push({title, price});
  }

  return data;
});

browser.close();
return result;
};

scrape().then((value) => {
console.log(value);
});

To be clear. I am following a tutorial here - this code comes from Brandon Morelli on codeburst.io!! https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921

metatron · Accepted Answer

I was following same article in order to educate myself on how to use Puppeteer. Short answer on your question is that you need to introduce one more loop to iterate over all available pages in online book catalogue. I've done following steps in order to collect all book titles and prices:

Extracted page.evaluate part in separate async function that takes page as argument
Introduced for-loop with hardcoded last catalogue page number (you can extract it with help of Puppeteer if you wish)
Placed async function from step one inside a loop

Same exact code from Brandon Morelli article, but now with one extra loop:

const puppeteer = require('puppeteer');

let scrape = async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    await page.goto('http://books.toscrape.com/');

    var results = []; // variable to hold collection of all book titles and prices
    var lastPageNumber = 50; // this is hardcoded last catalogue page, you can set it dunamically if you wish
    // defined simple loop to iterate over number of catalogue pages
    for (let index = 0; index < lastPageNumber; index++) {
        // wait 1 sec for page load
        await page.waitFor(1000);
        // call and wait extractedEvaluateCall and concatenate results every iteration.
        // You can use results.push, but will get collection of collections at the end of iteration
        results = results.concat(await extractedEvaluateCall(page));
        // this is where next button on page clicked to jump to another page
        if (index != lastPageNumber - 1) {
            // no next button on last page
            await page.click('#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a');
        }
    }

    browser.close();
    return results;
};

async function extractedEvaluateCall(page) {
    // just extracted same exact logic in separate function
    // this function should use async keyword in order to work and take page as argument
    return page.evaluate(() => {
        let data = [];
        let elements = document.querySelectorAll('.product_pod');

        for (var element of elements) {
            let title = element.childNodes[5].innerText;
            let price = element.childNodes[7].children[0].innerText;

            data.push({ title, price });
        }

        return data;
    });
}

scrape().then((value) => {
    console.log(value);
    console.log('Collection length: ' + value.length);
    console.log(value[0]);
    console.log(value[value.length - 1]);
});

Console output:

...
  { title: 'In the Country We ...', price: '£22.00' },
  ... 900 more items ]
Collection length: 1000
{ title: 'A Light in the ...', price: '£51.77' }
{ title: '1,000 Places to See ...', price: '£26.08' }

Overcoming pagination when using puppeteer (library) for web-scraping

Tags:

node.js

pagination

web-scraping

puppeteer

David Pears

1 Answers

metatron

Recent Activity

Donate For Us

Overcoming pagination when using puppeteer (library) for web-scraping

Tags:

node.js

pagination

web-scraping

puppeteer

David Pears

1 Answers

metatron

Related questions

Recent Activity

Donate For Us