I am trying to extract using Puppeteer the title of this page: https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106
I have the below code,
(async () => {
const browser = await puppet.launch({ headless: true });
const page = await browser.newPage();
await page.goto(req.params[0]); //this is the url
title = await page.evaluate(() => {
Array.from(document.querySelectorAll("meta")).filter(function (
el
) {
return (
(el.attributes.name !== null &&
el.attributes.name !== undefined &&
el.attributes.name.value.endsWith("title")) ||
(el.attributes.property !== null &&
el.attributes.property !== undefined &&
el.attributes.property.value.endsWith("title"))
);
})[0].attributes.content.value ||
document.querySelector("title").innerText;
});
which I have tested using the browser console and even using the { headless: false } option of Puppeteer. It works as expected in the browser, but when I actually run it with node it gives me the following error.
10:54:21 AM web.1 | (node:10288) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'attributes' of undefined
10:54:21 AM web.1 | at __puppeteer_evaluation_script__:14:20
So, when I run the same Array.from ...querySelectorAll("meta")... query in the browser I get the expected string:
"Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom"
I'm starting to think I'm doing something wrong with the async promises, as that is the part that is different. Can anyone point me in the right direction?
EDIT: As suggested, I tested using document.title, which should be there, but it also returned null. See code and log below:
console.log(
"testing the return",
(async () => {
const browser = await puppet.launch({ headless: true });
const page = await browser.newPage();
await page.goto(req.params[0]); //this is the url
try {
title = await page.evaluate(() => {
const title = document.title;
const isTitleThere = title == null ? false : true;
//recently read that this checks for undefined as well as null but not an
//undeclared var
return {
title: title,
titleTitle: title.title,
isTitleThere: isTitleThere,
};
});
} catch (error) {
console.log(error, "There was an error");
}
11:54:11 AM web.1 | testing the return Promise { <pending> }
11:54:13 AM web.1 | { title: '', isTitleThere: true }
Does this have to do with single-page application bs? I thought puppeteer handled that because it loads everything first.
EDIT: I have added the networkidle lines and await 8000 milliseconds, as suggested. Title is still empty. Code below and log:
await page.goto(req.params[0], { waitUntil: "networkidle2" });
await page.waitFor(8000);
console.log("done waiting");
title = await page.$eval("title", (el) => el.innerText);
console.log("title: ", title);
console.log("done retrieving");
12:36:39 PM web.1 | done waiting
12:36:39 PM web.1 | title:
12:36:39 PM web.1 | done retreiving
EDIT: PROGRESS!! Thank you to theDavidBarton. It seems headless has to be false for it work? Does anyone know why?
If you only need the innerText of title you could do it with page.$eval puppeteer method to achieve the same result:
const title = await page.$eval('title', el => el.innerText)
console.log(title)
Output:
Zella High Waist Studio Pocket 7/8 Leggings | Nordstrom
page.$$eval(selector, pageFunction[, ...args])
The page.$eval method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.
However: your main problem is that the page you are visiting is a Single-Page App (SPA) made in React.Js, and its title is filled dynamically by the JavaScript bundle. So your puppeteer finds a valid title element in the <head> when its content is simply: "" (an empty string).
Normally you should use waitUntil: 'networkidle0' in case of SPAs to make sure the DOM is populated by the actual JS framework properly and it is fully functional:
await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
waitUntil: 'networkidle0'
})
Unfortunately with this specific website it throws a timeout error as the network connections don't close until the 30000 ms default timeout, something seems to be not OK on the webpage's frontend side (webworker handling?).
As a workaround you can force puppeteer sleep for 8 seconds with: await page.waitFor(8000) before you try to retrieve the title: by that time it will be properly populated. Actually when you run your script in DevTools Console it works because you are not immediately running the script: that time the page is already fully loaded, DOM is populated.
This script will return the expected title:
async function fn() {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage()
await page.goto('https://www.nordstrom.com/s/zella-high-waist-studio-pocket-7-8-leggings/5460106', {
waitUntil: 'networkidle2'
})
await page.waitFor(8000)
const title = await page.$eval('title', el => el.innerText)
console.log(title)
await browser.close()
}
fn()
Maybe const browser = await puppeteer.launch({ headless: false }) affects the result as well.
when navigating to the page wait until the page is loaded
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
Could you try this
try {
title = await page.evaluate(() => {
const title = document.title;
const isTitleThere = title == null? false: true
//recently read that this checks for undefined as well as null but not an
//undeclared var
return {"title":title,"isTitleThere" :isTitleThere }
})
} catch (error) {
console.log(error, 'There was an error');
}
or this
try {
title = await page.evaluate(() => {
const title = document.querySelector('meta[property="og:title"]');
const isTitleThere = title == null? false: true
//recently read that this checks for undefined as well as null but not an
//undeclared var
return {"title":title,"isTitleThere" :isTitleThere }
})
} catch (error) {
console.log(error, 'There was an error');
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With