I'm using Cheerio JS to simplify some ancient HTML code and transform it into HTML5. Among other things, I'm replacing some markup-heavy quotes that look like the following:
Node to be replaced:
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">Quote:</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tbody>
<tr>
<td class="alt2" style="border:1px solid #999">
<div>
Originally Posted by <strong>Username</strong>
</div>
<div style="font-style:italic">Lorem ipsum dolor sit amet</div>
</td>
</tr>
</tbody>
</table>
</div>
The transformed output is supposed to look like this:
<blockquote>Lorem ipsum dolor sit amet</blockquote>
Here's the code current code I'm using at this time:
$(`table[id^='post']`).each( (i, el) => {
// Get the post
let postBody = $(el).find(`div[id^='post_message_']`).html().trim();
// Replace quotes with blockquotes
cheerio.load(postBody)('div[style^="margin:20px; margin-top:5px; "]').each( (i, el) => {
if ($(el).html().trim().startsWith('<div class="smallfont" style="margin-bottom:2px">Quote')) {
let tbody = $(el).find('tbody > tr > td').html();
let quote = $(el).find('tbody > tr > td > div');
if (quote.html() && quote.text().trim().startsWith('Originally Posted by')) {
let replacement = $('<blockquote>Hello</blockquote>');
quote.parent().html().replace(quote.html(), replacement);
}
// Looks all good
console.log($(el).html())
}
postBody = $(el).html();
});
});
And lastly, more HTML for some context:
<div id="post_message_123456">
As Username has previously written
<br>
<div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">Quote:</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tbody>
<tr>
<td class="alt2" style="border:1px solid #999">
<div>
Originally Posted by <strong>Username</strong>
</div>
<div style="font-style:italic">Lorem ipsum dolor sit amet</div>
</td>
</tr>
</tbody>
</table>
</div>
<br>
I think he has a point!
<img src="smile-with-sunglasses.gif" />
</div>
The replacement itself seems to work, the output of the console.log()
statement looks all good. The problem lies in the last line, where I'm trying to replace the original content with the replacement. However, postBody
looks like it did before. What am I doing wrong?
To select a <select> element, you use the DOM API like getElementById() or querySelector() . How it works: First, select the <button> and <select> elements using the querySelector() method. Then, attach a click event listener to the button and show the selected index using the alert() method when the button is clicked.
Cheerio can be used on any ES6+, TypeScript, and Node.js project, but for this article, we will focus on Node.js. To get started, we need to run the npm init -y command, which will generate a new package.json file with its contents like below: One way to verify that the installation was successful is by checking the package.json file.
We install cheerio, request, and local-web-server . Inside the project directory, where we have the index.html file, we start the local web server. It automatically serves the index.html file on three different locations. In the first example, we get the title of the document. The example prints the title of the HTML document.
Our getTables function is utilising Cheerio to load in the HTML, run a CSS selector over the HTML, and then return a Cheerio representation of those tables. We're then logging to the console the HTML for each of those table elements, which looks like this:
If you want to use cheerio for scraping a web page, you need to first fetch the markup using packages like axios or node-fetch among others. In this section, you will learn how to scrape a web page using cheerio.
Try it like this:
let $ = cheerio.load(html)
$('.alt2 div:contains("Originally Posted by")').replaceWith('<blockquote>Lorem ipsum dolor sit amet</blockquote>')
console.log($.html())
This demonstrates how you could swap out insecure with secure URLs as a useful real-world example and also make programatic decisions that is much easier to do than with regex for most normal humans.
const $ = cheerio.load(html)
// example replace all http:// with https://
$('img[src^="http://"]').replaceWith(function() {
const src = $(this).attr('src')
if (src.indexOf('s3.amazon.com')) {
src = src.replace('s3.amazon.com', 'storage.azure')
}
return $(this).attr('src', src.replace('http://', 'https://'))
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With