Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape Text From Iframe

How would I scrape text from an iframe with puppeteer.

As a simple reproducible example, scrape, This is a paragraph from the iframe of this url

https://www.w3schools.com/js/tryit.asp?filename=tryjs_events

like image 307
Alex Avatar asked Nov 26 '17 22:11

Alex


People also ask

How do I scrape data from an iframe?

Web scraping is about making the right HTTP requests in order to get the web server to return the data you’re hoping to extract. In the case of iFrames, the parent page is actually embedding another page inside itself. If the data you want is inside the iFrame, all you have to do is find the URL of the page that’s loaded there.

What is the difference between iframes and web scraping?

They’re always your guiding light when you get stuck. Web scraping is about making the right HTTP requests in order to get the web server to return the data you’re hoping to extract. In the case of iFrames, the parent page is actually embedding another page inside itself.

How to get HTML content of an iframe using JavaScript?

- GeeksforGeeks How to get HTML content of an iFrame using JavaScript ? The <iframe> tag specifies an inline frame. It allows us to load a separate HTML file into an existing document. Some of the definitions are given below: getIframeContent (frameId): It is used to get the object reference of an iframe.

How do I scrape iFrames in selenium?

If you try to scrape a page that contains an iframe, you won’t get the iframe content; you need to scrape the iframe source. You can use Selenium to scrape iframes by switching to the frame you want to scrape. Check the current URL; it’s the iframe URL, not the original page.


2 Answers

To scrape an iframe's text in puppeteer, you can use puppeteer's page.evaluate to evaluate JavaScript in the context of the page that returns the iframe's contents.

The steps to do so are:

  1. Grab the iframe Element
  2. Get the iframe's document object.
  3. Use the document object to read the iframe's HTML

I wrote this program that grabs This is a paragraph from the link you provided:

const puppeteer = require("puppeteer");

(async () => {

    const browser = await puppeteer.launch();

    const page = await browser.newPage();
    await page.goto('https://www.w3schools.com/js/tryit.asp?filename=tryjs_events');

    const iframeParagraph = await page.evaluate(() => {

        const iframe = document.getElementById("iframeResult");

        // grab iframe's document object
        const iframeDoc = iframe.contentDocument || iframe.contentWindow.document;

        const iframeP = iframeDoc.getElementById("demo");

        return iframeP.innerHTML;
    });

    console.log(iframeParagraph); // prints "This is a paragraph"

    await browser.close();

})();
like image 192
Christian Santos Avatar answered Nov 06 '22 10:11

Christian Santos


I know that this question already has an answer, but if maybe someone wants to go for another approach where you can grab the content from an iframe and use cheerio to traverse over the elements and get the text of any element you want - you can find it here.

like image 37
Gregor Ojstersek Avatar answered Nov 06 '22 09:11

Gregor Ojstersek