I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with: <pre class="prettyprint"><code>await page.screenshot({path: 'example.png'}); </code></pre> But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like: <pre class="prettyprint"><code>const html = await page.content(); // ... write to file </code></pre> Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page: <pre class="prettyprint"><code>page.on('request', request => { if (request.resourceType() === 'image') { const imgUrl = request.url() download(imgUrl, 'download').then((output) => { images.push({url: output.url, filename: output.filename}) }).catch((err) => { console.log(err) }) request.abort() } else { request.continue() } }) </code></pre> Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D. Now when I save the content, I want to point it to the offline images in the source. <pre class="prettyprint"><code>const html = await page.content(); </code></pre> But now I like to replace all the <pre class="prettyprint"><code><img src="/pic.png?id=123"> <img src="https://twitter.com/pics/1.png"> </code></pre> And also things like: <pre class="prettyprint"><code><div style="background-image: url('this_also.gif')></div> </code></pre> So is there a way (in puppeteer) to scrape a big page and store the whole content offline ? Javascript and CSS would also be nice Update For now I will open the big html file again with puppeteer. And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, .... <pre class="prettyprint"><code>request.respond({ status: 200, contentType: 'image/jpeg', body: '..' }); </code></pre> I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()

Let's go back to the first, you can use <code>fullPage</code> to take the screenshot. <pre class="prettyprint"><code>await page.screenshot({path: 'example.png', fullPage: true}); </code></pre> If you really want to download all resources to offline, yes you can: <pre class="prettyprint"><code>const fse = require('fs-extra'); page.on('response', (res) => { // save all the data to SOMEWHERE_TO_STORE await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer()); }); </code></pre> Then, you can browser the website offline through puppeteer with everything all right. <pre class="prettyprint"><code>await page.setRequestInterception(true); page.on('request', (req) => { // handle the request by responding data that you stored in SOMEWHERE_TO_STORE // and of course, don't forget THE_FILE_TYPE req.respond({ status: 200, contentType: THE_FILE_TYPE, body: await fse.readFile(SOMEWHERE_TO_STORE), }); }); </code></pre>

Puppeteer, save webpage and images

Tags:

javascript

html

node.js

web-scraping

puppeteer

I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:

await page.screenshot({path: 'example.png'});

But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:

const html = await page.content();
// ... write to file

Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:

page.on('request', request => {
    if (request.resourceType() === 'image') {
        const imgUrl = request.url()
        download(imgUrl, 'download').then((output) => {
            images.push({url: output.url, filename: output.filename})
        }).catch((err) => {
            console.log(err)
        })
        request.abort()
    } else {
        request.continue()
    }
})

Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.

Now when I save the content, I want to point it to the offline images in the source.

const html = await page.content();

But now I like to replace all the

<img src="/pic.png?id=123"> 
<img src="https://twitter.com/pics/1.png">

And also things like:

<div style="background-image: url('this_also.gif')></div>

So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?

Javascript and CSS would also be nice

Update

For now I will open the big html file again with puppeteer.

And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....

request.respond({
    status: 200,
    contentType: 'image/jpeg',
    body: '..'
});

I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()

235

asked Dec 05 '18 20:12

Johan Hoeksma

1 Answers

Let's go back to the first, you can use fullPage to take the screenshot.

await page.screenshot({path: 'example.png', fullPage: true});

If you really want to download all resources to offline, yes you can:

const fse = require('fs-extra');

page.on('response', (res) => {
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});

Then, you can browser the website offline through puppeteer with everything all right.

await page.setRequestInterception(true);
page.on('request', (req) => {
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond({
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    });
});

127

answered Oct 21 '22 23:10

ayiis

Related questions
                            
                                Using <%=[variable]=> syntax with htmlWebpackPlugin AND html-loader
                            
                                Is new super.constructor a valid expression in JavaScript?
                            
                                D3 custom curve: bundle interpolation for areas
                            
                                In Javascript (but not Node), how do I divde two Uint8Arrays?
                            
                                Firestore Query - Joining two collections
                            
                                Different height between FormControl and ControlLabel in react-bootstrap
                            
                                Runtime Configuration for Angular 6+ Applications
                            
                                How to properly override babel@7 plugins for separate webpack server/client configurations
                            
                                Close bootstrap dropdown only when mouse is clicked outside of dropdown
                            
                                How to pass percentage value to width to React data grid column
                            
                                react setState array of object using map doesn't work?
                            
                                How to put Vue.js in production mode using webpack 2.7?
                            
                                File size issue in Chrome for base64-encoded PDF data in OBJECT tag
                            
                                AngularFirebaseAuth : Calling server api just after firebase auth?
                            
                                Angular: Bootstrap 4 dark & light mode switch
                            
                                Attach a file to a PDF using client-side JavaScript?
                            
                                How does 'this' keyword work in prototype chain?
                            
                                SVG animations: sluggish/poor performance in Chrome
                            
                                Filling a dropdown list based on another dropdown list in the same html form
                            
                                Angular 6 fullCalendar Display popover on mouseover event

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With