Currently I am scraping article news sites, in the process of getting its main content, I ran into the issue that a lot of them have embedded tweets in them like these:
I use XPath expressions with XPath helper(chrome addon) in order to test if I can get content, then add this expression to scrapy python, but with elements that are inside a #shadow-root
elements seem to be outside the scope of the DOM, I am looking for a way to get content inside these types of elements, preferably with XPath.
To access Shadow DOM elements in Selenium 4 with Chromium browsers (Microsoft Edge and Google Chrome) version 96 or greater, use the new shadow root method: java. ruby. python.
Most web scrapers, including Scrapy, don't support the Shadow DOM, so you will not be able to access elements in shadow trees at all.
And even if a web scraper did support the Shadow DOM, XPath is not supported at all. Only selectors are supported to some extent, as documented in the CSS Scoping spec.
One way to scrape pages containing shadow DOMs with tools that don't work with shadow DOM API is to recursively iterate over shadow DOM elements and replace them with their HTML code:
// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
let shadowHTML = '';
for (let el of shadowRoot.childNodes) {
shadowHTML += el.nodeValue || el.outerHTML;
}
return shadowHTML;
};
// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
for (let el of rootElement.querySelectorAll('*')) {
if (el.shadowRoot) {
replaceShadowDomsWithHtml(el.shadowRoot)
el.innerHTML += getShadowDomHtml(el.shadowRoot);
}
}
};
replaceShadowDomsWithHtml(document.body);
If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.
Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With