I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:
await page.screenshot({path: 'example.png'});
But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:
const html = await page.content();
// ... write to file
Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:
page.on('request', request => {
if (request.resourceType() === 'image') {
const imgUrl = request.url()
download(imgUrl, 'download').then((output) => {
images.push({url: output.url, filename: output.filename})
}).catch((err) => {
console.log(err)
})
request.abort()
} else {
request.continue()
}
})
Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.
Now when I save the content, I want to point it to the offline images in the source.
const html = await page.content();
But now I like to replace all the
<img src="/pic.png?id=123">
<img src="https://twitter.com/pics/1.png">
And also things like:
<div style="background-image: url('this_also.gif')></div>
So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?
Javascript and CSS would also be nice
Update
For now I will open the big html file again with puppeteer.
And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....
request.respond({
status: 200,
contentType: 'image/jpeg',
body: '..'
});
I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()
It “provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.” In web scraping, Puppeteer gives our script all the power of a browser engine, allowing us to scrape pages that require Javascript execution (like SPAs), scrape infinite scrolling, dynamic content, and more.
Setting up a download path and reading from the disk This method tells the browser in what folder we want to download a file from Puppeteer after clicking on it, and then it uses the file system to get the file from the actor's disk into memory or save it into the Key-value store for later usage/download.
Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full (non-headless) Chrome/Chromium.
Let's go back to the first, you can use fullPage
to take the screenshot.
await page.screenshot({path: 'example.png', fullPage: true});
If you really want to download all resources to offline, yes you can:
const fse = require('fs-extra');
page.on('response', (res) => {
// save all the data to SOMEWHERE_TO_STORE
await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});
Then, you can browser the website offline through puppeteer with everything all right.
await page.setRequestInterception(true);
page.on('request', (req) => {
// handle the request by responding data that you stored in SOMEWHERE_TO_STORE
// and of course, don't forget THE_FILE_TYPE
req.respond({
status: 200,
contentType: THE_FILE_TYPE,
body: await fse.readFile(SOMEWHERE_TO_STORE),
});
});
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With