Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving JavaScript Rendered HTML with Puppeteer

I am attempting to scrape the html from this NCBI.gov page. I need to include the #see-all URL fragment so that I am guaranteed to get the searchpage instead of retrieving the HTML from an incorrect gene page https://www.ncbi.nlm.nih.gov/gene/119016.

URL fragments are not passed to the server, and are instead used by the javascript of the page client-side to (in this case) create entirely different HTML, which is what you get when you go to the page in a browser and "View page source", which is the HTML I want to retrieve. R readLines() ignores url tags followed by #

I tried using phantomJS first, but it just returned the error described here ReferenceError: Can't find variable: Map, and it seems to result from phantomJS not supporting some feature that NCBI was using, thus eliminating this route to solution.

I had more success with Puppeteer using the following Javascript evaluated with node.js:

const puppeteer = require('puppeteer');
(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
  var HTML = await page.content()
  const fs = require('fs');
  var ws = fs.createWriteStream(
    'TempInterfaceWithChrome.js'
  );
  ws.write(HTML);
  ws.end();
  var ws2 = fs.createWriteStream(
    'finishedFlag'
  );
  ws2.end();
  browser.close();
})();

however this returned what appeared to be the pre-rendered html. how do I (programmatically) get the final html that I get in browser?

like image 557
Sir_Zorg Avatar asked Aug 24 '17 21:08

Sir_Zorg


People also ask

Does puppeteer run JavaScript?

And finally, we're using Puppeteer's built-in method called evaluate() . This method lets us run custom JavaScript code as if we were executing it in the DevTools console. Anything returned from that function gets resolved by the promise.

How do you find the puppeteer element?

We can get element text in Puppeteer. This is done with the help of the textContent property. This property of the element is passed as a parameter to the getProperty method.


3 Answers

You can try to change this:

await page.goto(
  'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');

into this:

  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});

Or, you can create a function listenFor() to listen to a custom event on page load:

function listenFor(type) {
  return page.evaluateOnNewDocument(type => {
    document.addEventListener(type, e => {
      window.onCustomEvent({type, detail: e.detail});
    });
  }, type);
}`

await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.

LE:

This also might come in handy:

await page.waitForSelector('h3'); // replace h3 with your selector
like image 151
Carol-Theodor Pelu Avatar answered Oct 06 '22 01:10

Carol-Theodor Pelu


Maybe try to wait

await page.waitForNavigation(5);

and after

let html = await page.content();
like image 34
Evgeniy Grabelsky Avatar answered Oct 05 '22 23:10

Evgeniy Grabelsky


I had success using the following to get html content that was generated after the page has been loaded.

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

Hope this helps.

like image 45
Darren Hall Avatar answered Oct 06 '22 01:10

Darren Hall