How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.
However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling
html_contents = await page.content()
and saving the results, but that saves a copy without any non-HTML elements.
Is there way to save webpages for offline use with Puppeteer?
Open the three-dot menu on the top right and select More Tools > Save page as. You can also right-click anywhere on the page and select Save as or use the keyboard shortcut Ctrl + S in Windows or Command + S in macOS. Chrome can save the complete web page, including text and media assets, or just the HTML text.
It “provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.” In web scraping, Puppeteer gives our script all the power of a browser engine, allowing us to scrape pages that require Javascript execution (like SPAs), scrape infinite scrolling, dynamic content, and more.
It is currently possible via experimental CDP call 'Page.captureSnapshot'
using MHTML format:
'use strict';
const puppeteer = require('puppeteer');
const fs = require('fs');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://en.wikipedia.org/wiki/MHTML');
const cdp = await page.target().createCDPSession();
const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
fs.writeFileSync('page.mhtml', data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With