Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Open Puppeteer with specific configuration (download PDF instead of PDF viewer)

I would like to open Chromium with a specific configuration.

I am looking for the configuration to activate the following option :

Settings => Site Settings => Permissions => PDF documents => "Download PDF files instead of automatically openning them in Chrome"

I searched the tags on this command line switch page but the only parameter that deals with pdf is --print-to-pdf which does not correspond to my need.

Do you have any ideas?

like image 909
Jeck Avatar asked May 22 '19 09:05

Jeck


People also ask

How do I change the open settings on a PDF?

Right-click the PDF, choose Open With > Choose default program or another app in. 2. Choose Adobe Acrobat Reader DC or Adobe Acrobat DC in the list of programs, and then do one of the following: (Windows 10) Select Always use this app to open .


2 Answers

There is no option you can pass into Puppeteer to force PDF downloads. However, you can use chrome-devtools-protocol to add a content-disposition: attachment response header to force downloads.

A visual flow of what you need to do:

cdp-modify-response-header (2)

I'll include a full example code below. In the example below, PDF files and XML files will be downloaded in headful mode.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null, 
  });

  const page = await browser.newPage();

  const client = await page.target().createCDPSession();

  await client.send('Fetch.enable', {
    patterns: [
      {
        urlPattern: '*',
        requestStage: 'Response',
      },
    ],
  });

  await client.on('Fetch.requestPaused', async (reqEvent) => {
    const { requestId } = reqEvent;

    let responseHeaders = reqEvent.responseHeaders || [];
    let contentType = '';

    for (let elements of responseHeaders) {
      if (elements.name.toLowerCase() === 'content-type') {
        contentType = elements.value;
      }
    }

    if (contentType.endsWith('pdf') || contentType.endsWith('xml')) {

      responseHeaders.push({
        name: 'content-disposition',
        value: 'attachment',
      });

      const responseObj = await client.send('Fetch.getResponseBody', {
        requestId,
      });

      await client.send('Fetch.fulfillRequest', {
        requestId,
        responseCode: 200,
        responseHeaders,
        body: responseObj.body,
      });
    } else {
      await client.send('Fetch.continueRequest', { requestId });
    }
  });

  await page.goto('https://pdf-xml-download-test.vercel.app/');

  await page.waitFor(100000);

  await client.send('Fetch.disable');

  await browser.close();
})();

For a more detailed explanation, please refer to the Git repo I've setup with comments. It also includes an example code for playwright.

like image 84
subwaymatch Avatar answered Nov 02 '22 23:11

subwaymatch


Puppeteer currently does not support navigating (or downloading) PDFs in headless mode that easily. Quote from the docs for the page.goto function:

NOTE Headless mode doesn't support navigation to a PDF document. See the upstream issue.

What you can do though, is detect if the browser is navigating to the PDF file and then download it yourself via Node.js.

Code sample

const puppeteer = require('puppeteer');
const http = require('http');
const fs = require('fs');

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    page.on('request', req => {
        if (req.url() === '...') {
            const file = fs.createWriteStream('./file.pdf');
            http.get(req.url(), response => response.pipe(file));
        }
    });

    await page.goto('...');
    await browser.close();
})();

This navigates to a URL and monitors the ongoing requests. If the "matched request" is found, Node.js will manually download the file via http.get and pipe it into file.pdf. Please be aware that this is a minimal working example. You want to catch errors when downloading and might also want to use something more sophisticated then http.get depending on the situation.

Future note

In the future, there might be an easier way to do it. When puppeteer will support response interception, you will be able to simply force the browser to download a document, but right now this is not supported (May 2019).

like image 43
Thomas Dondorf Avatar answered Nov 02 '22 22:11

Thomas Dondorf