Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Xpath expressions access shadow-root elements?

Currently I am scraping article news sites, in the process of getting its main content, I ran into the issue that a lot of them have embedded tweets in them like these:

enter image description here

I use XPath expressions with XPath helper(chrome addon) in order to test if I can get content, then add this expression to scrapy python, but with elements that are inside a #shadow-root elements seem to be outside the scope of the DOM, I am looking for a way to get content inside these types of elements, preferably with XPath.

like image 383
Necronet Avatar asked Apr 10 '18 22:04

Necronet


People also ask

Does Selenium support shadow DOM?

To access Shadow DOM elements in Selenium 4 with Chromium browsers (Microsoft Edge and Google Chrome) version 96 or greater, use the new shadow root method: java. ruby. python.


2 Answers

Most web scrapers, including Scrapy, don't support the Shadow DOM, so you will not be able to access elements in shadow trees at all.

And even if a web scraper did support the Shadow DOM, XPath is not supported at all. Only selectors are supported to some extent, as documented in the CSS Scoping spec.

like image 179
BoltClock Avatar answered Oct 18 '22 23:10

BoltClock


One way to scrape pages containing shadow DOMs with tools that don't work with shadow DOM API is to recursively iterate over shadow DOM elements and replace them with their HTML code:

// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
    let shadowHTML = '';
    for (let el of shadowRoot.childNodes) {
        shadowHTML += el.nodeValue || el.outerHTML;
    }
    return shadowHTML;
};

// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
    for (let el of rootElement.querySelectorAll('*')) {
        if (el.shadowRoot) {
            replaceShadowDomsWithHtml(el.shadowRoot)
            el.innerHTML += getShadowDomHtml(el.shadowRoot);
        }
    }
};

replaceShadowDomsWithHtml(document.body);

If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.

Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom

like image 5
Marek Trunkát Avatar answered Oct 18 '22 22:10

Marek Trunkát