Extracting text based on a regex pattern with cheerio nodejs

Question

I am trying to build a scraper with node.js which will allow me to extract news headlines from a large number of domains (they are all different so I have to be as general as possible in my approach). At the moment I have a working implementation in python which utilises Beautiful soup and regex allowing me to define a set of keywords and return headlines containing those keywords. Below is a relevant snippet of python code:

for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))

To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):

<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>

The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers

My question is: is it possible to do a similar thing with cheerio? What would be the best approach to achieve the same results in nodejs?

EDIT: This works for me now. On top of matching headlines I also wanted to extract post urls

function match_headlines($) {

      const keywords = ['lockdown', 'quarantine'];

      new RegExp('\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" + 
                 '.*\b', "g");

      let matches = $('a').map((i, a) => {

          let links = $(a).attr('href');
          let match = $(a).text().match(regexPattern);

          if (match !== null) {

             let posts = {

                 headline: match['input'],
                 post_url: links
             }

             return posts

          }

     })

     return matches.filter((x) => x !== null)

}

pguardiario · Accepted Answer

Maybe something like this:

let re = new RegExp('\b' + keywords.join('|') + '\b')
let texts = $('a h3').map((i, a) => $(a).text())
let titles = texts.filter(text => text.match(re))

Extracting text based on a regex pattern with cheerio nodejs

Tags:

node.js

cheerio

Matthew

1 Answers

pguardiario

Recent Activity

Donate For Us

Extracting text based on a regex pattern with cheerio nodejs

Tags:

node.js

cheerio

Matthew

1 Answers

pguardiario

Related questions

Recent Activity

Donate For Us