I'm building a scraper, however I'm stuck at iterating through the elementHandles.
I need to get the list of row elements which I do successfully. After that for each row I need to capture tds text/innerHTML (unsure which is which). For now it would be great just to print them out in stdout.
The error I'm getting is UnhandledPromiseRejectionWarning: TypeError: tds.forEach is not a function
, which from my googling around tells me that tds is not an array.
I am able to achieve this in python and selenium, but since I'm a javascript newbie, I anticipate I'm doing something very wrong.
From my understanding element.$$('td')
returns a Promise, but if I put await I get the SyntaxError: await is only valid in async function
const selectors = await page.$$('#transactionItems > tbody > tr');
console.log(selectors.length); // outputs 31 which is the right number
selectors.forEach( (element) => {
let tds = element.$$('td');
console.log(tds);
tds.forEach( (element) => {
console.log(element.innerText)
});
});
EDIT:
I have tried the following code which prints it successfully, but that's still not what I'd want.
const selectors = await page.$$('#transactionItems > tbody > tr ');
console.log(selectors.length);
for(let tr of selectors){
const trText = await page.evaluate(el => el.innerHTML, tr);
console.log(trText)
}
it outputs the following:
<td> T737410C - <a class="pointer" target="_blank" onclick="openAPRImageWindow("T071835642571","112255603963");">Image</a></td>
<td>02/05/2018 06:48:06</td>
<td>DRPA</td>
<td> 07W - CBB</td>
<td>OPEN</td>
<td>$5.00</td>
<td>$25.00</td>
<td>$0.00</td>
<td>$30.00</td>
What I would ideally need the output to be is
['T737410C', '02/05/2018 06:48:06', 'OPEN', '5.00', '25.00']
Try this script:-
const puppeteer = require('puppeteer');
const html = `
<html>
<body>
<table>
<tr><td> T737410C - <a href=".">Image</a></td>
<td>02/05/2018 06:48:06</td><td>DRPA</td>
<td> 07W - CBB</td><td>OPEN</td></tr>
</table>
</body>
</html>`;
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(`data:text/html,${html}`);
const data = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr td'))
return tds.map(td => {
var txt = td.innerHTML;
return txt.replace(/<a [^>]+>[^<]*<\/a>/g, '').trim();
});
});
//You will now have an array of strings
console.log(data);
await browser.close();
})()
However it is worth mentioning that you may need to do some extra replaces to remove the trailing dashes etc.
Outputs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With