Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to dump more than <body> on chrome / chromium headless?

Chrome's documentation states:

The --dump-dom flag prints document.body.innerHTML to stdout:

As per the title, how can more of the DOM object (ideally all) be dumped with Chromium headless? I can manually save the entire DOM via the developer tools, but I want a programmatic solution.

like image 737
Jesse Meyer Avatar asked Jun 30 '17 17:06

Jesse Meyer


People also ask

Does Chromium support headless?

Headless Chromium allows running Chromium in a headless/server environment. Expected use cases include loading web pages, extracting metadata (e.g., the DOM) and generating bitmaps from page contents -- using all the modern web platform features provided by Chromium and Blink.

How do I run Chrome in headless mode?

As we have already seen, you just have to add the flag –headless when you launch the browser to be in headless mode. With CLI (Command Line Interface), just write: chrome \<br> – headless \ # Runs Chrome in headless mode. <br> – disable-gpu \ # Temporarily needed if running on Windows.

Is Chrome 60 is a headless Web browser?

Starting with version 60, the Chrome browser introduced the ability to run in headless mode. We now have the ability to launch the browser without creating a visual browser window.

What is Headless_shell?

Headless shell The headless shell is a sample application which demonstrates the use of the headless API. To run it, first initialize a headless build configuration: $ mkdir -p out/Debug $ echo 'import("//build/args/headless.gn")' > out/Debug/args.gn $ gn gen out/Debug.


1 Answers

Update 2019-04-23 Google was very active on headless front and many updates happened

The answer below is valid for the v62 current version is v73 and it's updating all the time. https://www.chromestatus.com/features/schedule

I highly recommend checking puppeteer for any future development with headless chrome. It is maintained by Google and it installs required Chrome version together with npm package so you just use puppeteer API by the docs and not worry about Chrome versions and setting up the connection between headless Chrome and dev tools API which allows doing 99% of the magic.

  • Repo: https://github.com/GoogleChrome/puppeteer
  • Docs: https://pptr.dev/

Update 2017-10-29 Chrome has already --dump-html flag which returns full HTML, not only body.

v62 does have it, it is already on stable channel.

Issue which fixed this: https://bugs.chromium.org/p/chromium/issues/detail?id=752747

Current chrome status (version per channel) https://www.chromestatus.com/features/schedule

Leaving old answer for legacy

You can do it with google chrome remote interface. I have tried it and wasted couple hours trying to launch chrome and get full html, including title and it is just not ready yet, i would say.

It works sometimes but i've tried to run it in production environment and got errors time to time. All kind of random errors like connection reset and no chrome found to kill. Those errors rised up sometimes and it's hard to debug.

I personally use --dump-dom to get html when i need body and when i need title i just use curl for now. Of course chrome can give you title from SPA applications, which can not be done with only curl if title is set from JS. Will switch to google chrome after having stable solution.

Would love to have --dump-html flag on chrome and just get all html. If Google's engineer is reading this, please add such flag to chrome.

I've created issue on Chrome issue tracker, please click favorite "star" to get noticed by google developers:

https://bugs.chromium.org/p/chromium/issues/detail?id=752747

Here is a long list of all kind of flags for chrome, not sure if it's full and all flags: https://peter.sh/experiments/chromium-command-line-switches/ nothing to dump title tag.

This code is from Google's blog post, you can try your luck with this:

const CDP = require('chrome-remote-interface');

...

(async function() {

const chrome = await launchChrome();
const protocol = await CDP({port: chrome.port});

// Extract the DevTools protocol domains we need and enable them.
// See API docs: https://chromedevtools.github.io/devtools-protocol/
const {Page, Runtime} = protocol;
await Promise.all([Page.enable(), Runtime.enable()]);

Page.navigate({url: 'https://www.chromestatus.com/'});

// Wait for window.onload before doing stuff.
Page.loadEventFired(async () => {
  const js = "document.querySelector('title').textContent";
  // Evaluate the JS expression in the page.
  const result = await Runtime.evaluate({expression: js});

  console.log('Title of page: ' + result.result.value);

  protocol.close();
  chrome.kill(); // Kill Chrome.
});

})();

Source: https://developers.google.com/web/updates/2017/04/headless-chrome

like image 83
Lukas Liesis Avatar answered Sep 30 '22 01:09

Lukas Liesis