Can Xpath expressions access shadow-root elements?

Tags:

Currently I am scraping article news sites, in the process of getting its main content, I ran into the issue that a lot of them have embedded tweets in them like these:

enter image description here

I use XPath expressions with XPath helper(chrome addon) in order to test if I can get content, then add this expression to scrapy python, but with elements that are inside a #shadow-root elements seem to be outside the scope of the DOM, I am looking for a way to get content inside these types of elements, preferably with XPath.

383

asked Apr 10 '18 22:04

Necronet

2 Answers

Most web scrapers, including Scrapy, don't support the Shadow DOM, so you will not be able to access elements in shadow trees at all.

And even if a web scraper did support the Shadow DOM, XPath is not supported at all. Only selectors are supported to some extent, as documented in the CSS Scoping spec.

179

answered Oct 18 '22 23:10

BoltClock

One way to scrape pages containing shadow DOMs with tools that don't work with shadow DOM API is to recursively iterate over shadow DOM elements and replace them with their HTML code:

// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
    let shadowHTML = '';
    for (let el of shadowRoot.childNodes) {
        shadowHTML += el.nodeValue || el.outerHTML;
    }
    return shadowHTML;
};

// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
    for (let el of rootElement.querySelectorAll('*')) {
        if (el.shadowRoot) {
            replaceShadowDomsWithHtml(el.shadowRoot)
            el.innerHTML += getShadowDomHtml(el.shadowRoot);
        }
    }
};

replaceShadowDomsWithHtml(document.body);

If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.

Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom

answered Oct 18 '22 22:10

Marek Trunkát

Related questions
                            
                                Difference between child, following and descendant in XPath axes
                            
                                .NET : How do you remove a specific node from an XMLDocument using XPATH?
                            
                                Using xslt get node value at X position
                            
                                Nested conditional if else statements in xpath
                            
                                XSLT, finding out if last child node is a specific element
                            
                                Finding xpath of a link using link text?
                            
                                XPath: logical OR
                            
                                how to get the normalize-space() xpath function to work?
                            
                                xpath nearest element to a given element
                            
                                XPath with lxml failing
                            
                                Remove parent element, keep all inner children in DOMDocument with saveHTML
                            
                                Get attribute value from c#/xpath
                            
                                Selenium WebDriver findElement(By.xpath()) not working for me
                            
                                C# XPath Not Finding Anything
                            
                                Are there any Java HTML parsers where the generated Nodes retain indexes to the original text?
                            
                                how to create XPATH for a HTML DOM element?
                            
                                How to get an XPath from selenium webelement or from lxml?
                            
                                retrieve xpath content from div id
                            
                                Is there a split function in xpath?
                            
                                BeautifulSoup extract XPATH or CSS Path of node

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can Xpath expressions access shadow-root elements?

Tags:

web-scraping

xpath

scrapy

shadow-dom

Necronet

People also ask

2 Answers

BoltClock

Marek Trunkát

Recent Activity

Donate For Us