Puppeteer: how to download entire web page for offline use

Tags:

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

and saving the results, but that saves a copy without any non-HTML elements.

Is there way to save webpages for offline use with Puppeteer?

208

asked Feb 21 '19 18:02

Coolio2654

1 Answers

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

164

answered Sep 20 '22 12:09

vsemozhebuty

Related questions
                            
                                for of loop querySelectorAll
                            
                                Typescript error : A 'super' call must be the first statement in the constructor when a class contains initialized properties
                            
                                HTML5 Local Storage VS App Cache Offline Website Browsing
                            
                                ES6 module: re-export as object
                            
                                Confuse about error and reject in Promise
                            
                                how to upload image file and display using express nodejs
                            
                                Promise: Ignore Catch and Return to Chain
                            
                                How to use promises, or complete an ajax request before the function finishes?
                            
                                Should I use self or this in service worker?
                            
                                Accessing object in returned promise using fetch w/ react js
                            
                                Using jQuery methods on event.target
                            
                                Javascript-ONLY DOM Tree Traversal - DFS and BFS?
                            
                                How to execute GraphQL query from server
                            
                                jQuery $(function() {}) vs (function () {})($) [duplicate]
                            
                                Webpack - CSS not applied
                            
                                Cannot read property 'style' of undefined -- Uncaught Type Error
                            
                                Auto generate index.d.ts, type definitions, from a typescript module
                            
                                Create-React-App with Moment JS: Cannot find module "./locale"
                            
                                Next.js custom class on body using _document.js
                            
                                Can the Web NFC api be used in Progressive Web Applications?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Puppeteer: how to download entire web page for offline use

Tags:

javascript

html

css

web-scraping

puppeteer

Coolio2654

People also ask

1 Answers

vsemozhebuty

Recent Activity

Donate For Us