Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Puppeteer, save webpage and images

I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:

await page.screenshot({path: 'example.png'});

But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:

const html = await page.content();
// ... write to file

Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:

page.on('request', request => {
    if (request.resourceType() === 'image') {
        const imgUrl = request.url()
        download(imgUrl, 'download').then((output) => {
            images.push({url: output.url, filename: output.filename})
        }).catch((err) => {
            console.log(err)
        })
        request.abort()
    } else {
        request.continue()
    }
})

Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.

Now when I save the content, I want to point it to the offline images in the source.

const html = await page.content();

But now I like to replace all the

<img src="/pic.png?id=123"> 
<img src="https://twitter.com/pics/1.png">

And also things like:

<div style="background-image: url('this_also.gif')></div>

So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?

Javascript and CSS would also be nice

Update

For now I will open the big html file again with puppeteer.

And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....

request.respond({
    status: 200,
    contentType: 'image/jpeg',
    body: '..'
});

I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()

like image 235
Johan Hoeksma Avatar asked Dec 05 '18 20:12

Johan Hoeksma


People also ask

Is puppeteer good for web scraping?

It “provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.” In web scraping, Puppeteer gives our script all the power of a browser engine, allowing us to scrape pages that require Javascript execution (like SPAs), scrape infinite scrolling, dynamic content, and more.

Can puppeteer download files?

Setting up a download path and reading from the disk This method tells the browser in what folder we want to download a file from Puppeteer after clicking on it, and then it uses the file system to get the file from the actor's disk into memory or save it into the Key-value store for later usage/download.

What is headless mode in puppeteer?

Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full (non-headless) Chrome/Chromium.


1 Answers

Let's go back to the first, you can use fullPage to take the screenshot.

await page.screenshot({path: 'example.png', fullPage: true});

If you really want to download all resources to offline, yes you can:

const fse = require('fs-extra');

page.on('response', (res) => {
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});

Then, you can browser the website offline through puppeteer with everything all right.

await page.setRequestInterception(true);
page.on('request', (req) => {
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond({
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    });
});

like image 127
ayiis Avatar answered Oct 21 '22 23:10

ayiis