Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Puppeteer iterating through elementHandles from page.$$ selector

I'm building a scraper, however I'm stuck at iterating through the elementHandles.

I need to get the list of row elements which I do successfully. After that for each row I need to capture tds text/innerHTML (unsure which is which). For now it would be great just to print them out in stdout.

The error I'm getting is UnhandledPromiseRejectionWarning: TypeError: tds.forEach is not a function, which from my googling around tells me that tds is not an array.

I am able to achieve this in python and selenium, but since I'm a javascript newbie, I anticipate I'm doing something very wrong.

From my understanding element.$$('td') returns a Promise, but if I put await I get the SyntaxError: await is only valid in async function

  const selectors = await page.$$('#transactionItems > tbody > tr');
  console.log(selectors.length); // outputs 31 which is the right number
  selectors.forEach( (element) => {
    let tds = element.$$('td');
    console.log(tds);
    tds.forEach( (element) => { 
      console.log(element.innerText)
    });
  });

EDIT:

I have tried the following code which prints it successfully, but that's still not what I'd want.

const selectors = await page.$$('#transactionItems > tbody > tr ');
console.log(selectors.length);
for(let tr of selectors){
  const trText = await page.evaluate(el => el.innerHTML, tr);
  console.log(trText)
}

it outputs the following:

<td> T737410C - <a class="pointer" target="_blank" onclick="openAPRImageWindow(&quot;T071835642571&quot;,&quot;112255603963&quot;);">Image</a></td>
<td>02/05/2018 06:48:06</td>
<td>DRPA</td>
<td> 07W - CBB</td>
<td>OPEN</td>
<td>$5.00</td>
<td>$25.00</td>
<td>$0.00</td>
<td>$30.00</td>

What I would ideally need the output to be is ['T737410C', '02/05/2018 06:48:06', 'OPEN', '5.00', '25.00']

like image 804
Borko Kovacev Avatar asked Apr 11 '18 20:04

Borko Kovacev


1 Answers

Try this script:-

const puppeteer = require('puppeteer');

const html = `
<html>
    <body>
    <table>
    <tr><td> T737410C - <a href=".">Image</a></td>
        <td>02/05/2018 06:48:06</td><td>DRPA</td>
        <td> 07W - CBB</td><td>OPEN</td></tr>
    </table>
    </body>
</html>`;

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(`data:text/html,${html}`);

  const data = await page.evaluate(() => {
      const tds = Array.from(document.querySelectorAll('table tr td'))
      return tds.map(td => {
         var txt = td.innerHTML;
         return txt.replace(/<a [^>]+>[^<]*<\/a>/g, '').trim();
      });
  });

  //You will now have an array of strings
  console.log(data);
  await browser.close();
})()

However it is worth mentioning that you may need to do some extra replaces to remove the trailing dashes etc.

Outputs

enter image description here

like image 88
Rippo Avatar answered Oct 21 '22 01:10

Rippo