I am trying to build a scraper with node.js which will allow me to extract news headlines from a large number of domains (they are all different so I have to be as general as possible in my approach). At the moment I have a working implementation in python which utilises Beautiful soup and regex allowing me to define a set of keywords and return headlines containing those keywords. Below is a relevant snippet of python code:
for items in soup(text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
To illustrate the expected output, lets assume there is a domain with news articles (Bellow is a html snippet containing a headline):
<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/uk-52773032"><h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Time to end Clap for Carers, says founder</h3></a>
The expected output given a keyword Time would be a string with a headline Time to end Clap for Carers
My question is: is it possible to do a similar thing with cheerio? What would be the best approach to achieve the same results in nodejs?
EDIT: This works for me now. On top of matching headlines I also wanted to extract post urls
function match_headlines($) {
const keywords = ['lockdown', 'quarantine'];
new RegExp('\\b[A-Z].*?' + "(" + test_keywords.join('|') + ")" +
'.*\\b', "g");
let matches = $('a').map((i, a) => {
let links = $(a).attr('href');
let match = $(a).text().match(regexPattern);
if (match !== null) {
let posts = {
headline: match['input'],
post_url: links
}
return posts
}
})
return matches.filter((x) => x !== null)
}
Maybe something like this:
let re = new RegExp('\\b' + keywords.join('|') + '\\b')
let texts = $('a h3').map((i, a) => $(a).text())
let titles = texts.filter(text => text.match(re))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With