Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to manipulate the DOM before in-page scripts are executed?

Using Puppeteer, how can I run a script in the page context, with the full DOM available, before the in-page JS is executed?

For example, how can I run the following script to remove alt attributes from img elements, before any of the page JS is run?

document.querySelectorAll('img[alt]').forEach(
  e => e.removeAttribute('alt')
)

(page.evaluateOnNewDocument looks like it would be useful, but it appears to be executed before the page content is available--at the point at which it runs, the page is blank.)

like image 528
mjs Avatar asked Feb 02 '18 06:02

mjs


People also ask

Which of the following is used for DOM manipulation?

jQuery provides various methods to add, edit or delete DOM element(s) in the HTML page. The following table lists some important methods to add/remove new DOM elements. Inserts content to the end of element(s) which is specified by a selector.

Can you manipulate the DOM with Python?

The DOM isn't a programming language, rather it's a programming interface therefore it's not limited to being used by only JavaScript and HTML. Here is a python script used to manipulate the DOM of an XML document. document = m.


1 Answers

I think the way to achieve what you are looking for is to perform:

  1. set page.setJavaScriptEnabled(false)
  2. enter the page
  3. extract all the scripts and HTML without scripts
  4. set page.setJavaScriptEnabled(true)
  5. enter page.goto(`data:text/html,${HTMLWithoutScript}`) with HTML from step 3
  6. execute your scripts
  7. incject original extracted scripts page.addScriptTag({ content: script }) from step 3

Example

Here is a visualization of your problematic example:

const puppeteer = require('puppeteer');

const html = `
<html>
    <head></head>
    <body>
        <img src="https://picsum.photos/200/300?image=1062" alt="dog ">
        <img src="https://picsum.photos/200/300?image=1072" alt="car ">
        <div class="alts">List of alts: </div>
        <script>
            const images = document.querySelectorAll('img');
            const altsContainer = document.querySelector('.alts');
            images.forEach(image => {
                const alt = image.getAttribute('alt') || 'missing alt ';
                altsContainer.insertAdjacentHTML('beforeend', alt);
            })
        </script>
    </body>
</html>`;

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(`data:text/html,${html}`);
    await page.evaluate(() => {
        document.querySelectorAll('img[alt]').forEach(
            e => e.removeAttribute('alt')
        )
    });
    await page.screenshot({ path: 'image.png' });
    await browser.close();
})();

This code produce:

broken example

So remove alts is not working here.

solution

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    
    await page.setJavaScriptEnabled(false);
    await page.goto(`data:text/html,${html}`);
    const { script, HTMLWithoutScript } = await page.evaluate(() => {
        const script = document.querySelector('script').innerHTML;
        document.querySelector('script').innerHTML = '';
        const HTMLWithoutScript = document.body.innerHTML;
        return { script, HTMLWithoutScript }
    });
    
    await page.setJavaScriptEnabled(true);
    await page.goto(`data:text/html,${HTMLWithoutScript}`);
    await page.evaluate(() => {
        document.querySelectorAll('img[alt]').forEach(
            e => e.removeAttribute('alt')
        )
    });
    await page.addScriptTag({ content: script });
    await page.screenshot({ path: 'image.png' });
    await browser.close();
})();

This will produce results as you expect in a question:

working example

like image 99
Everettss Avatar answered Oct 22 '22 06:10

Everettss