So I'm trying to scrape all the concerts in the boxed off area in the picture below:
https://i.stack.imgur.com/7QIMM.jpg
The problem is the list only presents the first 10 options until you scroll down in that specific div to the bottom, and then it dynamically presents more until there are no more results. I tried following the link below's answer but couldn't get it to scroll down to present all the 'concerts':
How to scroll inside a div with Puppeteer?
Here's my basic code:
const browser = await puppeteerExtra.launch({ args: [
'--no-sandbox'
]});
async function functionName() {
const page = await browser.newPage();
await preparePageForTests(page);
page.once('load', () => console.log('Page loaded!'));
await page.goto(`https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail`);
const resultList = await page.waitForSelector(".odIJnf");
const scrollableSection = await page.waitForSelector("#Q5Vznb"); //I think this is the div that contains all the concert items.
const results = await page.$$(".odIJnf"); //this needs to be iterable to be used in the for loop
//this is where I'd like to scroll down the div all the way to the bottom
for (let i = 0; i < results.length; i++) {
const result = await (await results[i].getProperty('innerText')).jsonValue();
console.log(result)
}
}
Use element. scroll() to Scroll to Bottom of Div in JavaScript. You can use element.
You need to get the top offset of the element you'd like to scroll into view, relative to its parent (the scrolling div container): var myElement = document. getElementById('element_within_div'); var topPos = myElement.
As a result, when you run the code in your browser, the div will scroll to the bottom. In addition, the element should be scrollable via CSS overflow-y: scroll. The Element.scrollIntoView () method will scroll an element to be visible to the user. As a result, you see the overflow content hidden from sight.
Thanks to Puppeteer, you can now extract data on infinite scrolling applications quickly and efficiently. While it may not be what you utilize in all cases, the script from this article should serve as a starting point for emulating human-like scrolling on an application.
Scroll to bottom with Element.scroll (). Scroll to bottom with Element.scrollIntoView (). A combination of scrollTop and scrollHeight can cause an element to scroll to the bottom because scrollTop determines the number of pixels for a vertical scroll. In contrast, scrollHeight is the element’s height (visible and non-visible parts).
And, no, not the kind that works Puppets. Puppeteer, is a headless Chrome Node API, allows you to emulate scrolling on the page and retrieve the desired data needed from the rendered elements. Puppeteer allows you to behave almost exactly as if you were in your regular browser, except programmatically and without a user interface.
Try this to scroll down on the list of concerts. You can keep looping until the number of results stops increasing, or you find the concert you are looking for:
await page.evaluate(()=>{
document.querySelector("#Q5Vznb").scrollIntoView(false);
});
As you mention in your question, when you run page.$$
, you get back an array of ElementHandle
. From Puppeteer's documentation:
ElementHandle represents an in-page DOM element. ElementHandles can be created with the
page.$
method.
This means you can iterate over them, but you also have to run evaluate()
or $eval()
over each element to access the DOM element.
I see from your snippet that you are trying to access the parent div
that handles the list scroll
event. The problem is that this page seems to be using auto-generated classes
and ids
. This might make your code brittle or not work properly. It would be best to try and access the ul
, li
, div
's direct.
I've created this snippet that can get ITEMS
amounts of concerts from the site:
const puppeteer = require('puppeteer')
/**
* Constants
*/
const ITEMS = process.env.ITEMS || 50
const URL = process.env.URL || "https://www.google.com/search?q=concerts+near+poughkeepsie&client=safari&rls=en&uact=5&ibp=htl;events&rciv=evn&sa=X&fpstate=tldetail"
/**
* Main
*/
main()
.then( () => console.log("Done"))
.catch((err) => console.error(err))
/**
* Functions
*/
async function main() {
const browser = await puppeteer.launch({ args: ["--no-sandbox"] })
const page = await browser.newPage()
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
await page.goto(URL)
const results = await getResults(page)
console.log(results)
await browser.close()
}
async function getResults(page) {
await page.waitForSelector("ul")
const ul = (await page.$$("ul"))[0]
const div = (await ul.$x("../../.."))[0]
const results = []
const recurse = async () => {
// Recurse exit clause
if (ITEMS <= results.length) {
return
}
const $lis = await page.$$("li")
// Slicing this way will avoid duplicating the result. It also has
// the benefit of not having to handle the refresh interval until
// new concerts are available.
const lis = $lis.slice(results.length, Math.Infinity)
for (let li of lis) {
const result = await li.evaluate(node => node.innerText)
results.push(result)
}
// Move the scroll of the parent-parent-parent div to the bottom
await div.evaluate(node => node.scrollTo(0, node.scrollHeight))
await recurse()
}
// Start the recursive function
await recurse()
return results
}
By studying the page structure, we see that the ul
for the list is nested in three div
s deep from the div
that handles the scroll
. We also know that there are only two ul
s on the page, and the first is the one we want. That is
what we do on these lines:
const ul = (await page.$$("ul"))[0]
const div = (await ul.$x("../../.."))[0]
The $x
function evaluates the XPath expression relative to the document as its context node*. It allows us to traverse the DOM tree until we find the div
that we need. We then run a recursive function until we get the items that we want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With