Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Puppeteer: how to download entire web page for offline use

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

and saving the results, but that saves a copy without any non-HTML elements.

Is there way to save webpages for offline use with Puppeteer?

like image 208
Coolio2654 Avatar asked Feb 21 '19 18:02

Coolio2654


People also ask

How can I download an entire HTML code from a website?

Open the three-dot menu on the top right and select More Tools > Save page as. You can also right-click anywhere on the page and select Save as or use the keyboard shortcut Ctrl + S in Windows or Command + S in macOS. Chrome can save the complete web page, including text and media assets, or just the HTML text.

Is puppeteer good for web scraping?

It “provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.” In web scraping, Puppeteer gives our script all the power of a browser engine, allowing us to scrape pages that require Javascript execution (like SPAs), scrape infinite scrolling, dynamic content, and more.


1 Answers

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();
like image 164
vsemozhebuty Avatar answered Sep 20 '22 12:09

vsemozhebuty