Retrieving JavaScript Rendered HTML with Puppeteer

Tags:

I am attempting to scrape the html from this NCBI.gov page. I need to include the #see-all URL fragment so that I am guaranteed to get the searchpage instead of retrieving the HTML from an incorrect gene page https://www.ncbi.nlm.nih.gov/gene/119016.

URL fragments are not passed to the server, and are instead used by the javascript of the page client-side to (in this case) create entirely different HTML, which is what you get when you go to the page in a browser and "View page source", which is the HTML I want to retrieve. R readLines() ignores url tags followed by #

I tried using phantomJS first, but it just returned the error described here ReferenceError: Can't find variable: Map, and it seems to result from phantomJS not supporting some feature that NCBI was using, thus eliminating this route to solution.

I had more success with Puppeteer using the following Javascript evaluated with node.js:

const puppeteer = require('puppeteer');
(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');
  var HTML = await page.content()
  const fs = require('fs');
  var ws = fs.createWriteStream(
    'TempInterfaceWithChrome.js'
  );
  ws.write(HTML);
  ws.end();
  var ws2 = fs.createWriteStream(
    'finishedFlag'
  );
  ws2.end();
  browser.close();
})();

however this returned what appeared to be the pre-rendered html. how do I (programmatically) get the final html that I get in browser?

557

asked Aug 24 '17 21:08

Sir_Zorg

3 Answers

You can try to change this:

await page.goto(
  'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all');

into this:

  await page.goto(
    'https://www.ncbi.nlm.nih.gov/gene/?term=AGAP8#see-all', {waitUntil: 'networkidle'});

Or, you can create a function listenFor() to listen to a custom event on page load:

function listenFor(type) {
  return page.evaluateOnNewDocument(type => {
    document.addEventListener(type, e => {
      window.onCustomEvent({type, detail: e.detail});
    });
  }, type);
}`

await listenFor('custom-event-ready'); // Listen for "custom-event-ready" custom event on page load.

LE:

This also might come in handy:

await page.waitForSelector('h3'); // replace h3 with your selector

151

answered Oct 06 '22 01:10

Carol-Theodor Pelu

Maybe try to wait

await page.waitForNavigation(5);

and after

let html = await page.content();

answered Oct 05 '22 23:10

Evgeniy Grabelsky

I had success using the following to get html content that was generated after the page has been loaded.

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

Hope this helps.

answered Oct 06 '22 01:10

Darren Hall

Related questions
                            
                                Display multiple columns in Select2
                            
                                Genuinely stop a element from binding - unbind an element - AngularJS
                            
                                AngularJS unexpected token in fromJson()
                            
                                Breakpoints ignored in Safari 7
                            
                                Angular js unit test mock document
                            
                                How to stop $observe in AngularJS
                            
                                fullscreenchange event not firing in Chrome
                            
                                JavaScript - The Good Parts: Function prototypes vs Object prototypes
                            
                                Using a factory inside another factory AngularJS
                            
                                How do I get the change event for a datalist?
                            
                                Is initialState in a mixin merged with initialState in a component?
                            
                                What is the difference between jQuery off() and unbind()
                            
                                How does piping a stream back to itself work with Trumpet?
                            
                                Number.isNaN doesn't exist in IE
                            
                                How to use Qt WebEngine and QWebChannel?
                            
                                Pass argument to reactjs click handler
                            
                                reactjs draggable and resizeable component
                            
                                How can I prevent overlapping in a family tree generator?
                            
                                Chrome - Break on attributes modification
                            
                                get promise value in react component

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retrieving JavaScript Rendered HTML with Puppeteer

Tags:

javascript

node.js

web-scraping

puppeteer

google-chrome-headless