Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace HTML nodes with Cheerio

I'm using Cheerio JS to simplify some ancient HTML code and transform it into HTML5. Among other things, I'm replacing some markup-heavy quotes that look like the following:

Node to be replaced:

<div style="margin:20px; margin-top:5px; ">
    <div class="smallfont" style="margin-bottom:2px">Quote:</div>
    <table cellpadding="6" cellspacing="0" border="0" width="100%">
        <tbody>
            <tr>
                <td class="alt2" style="border:1px solid #999">
                    <div>
                        Originally Posted by <strong>Username</strong>
                    </div>
                    <div style="font-style:italic">Lorem ipsum dolor sit amet</div>
                </td>
            </tr>
        </tbody>
    </table>
</div>

The transformed output is supposed to look like this:

<blockquote>Lorem ipsum dolor sit amet</blockquote>

Here's the code current code I'm using at this time:

$(`table[id^='post']`).each( (i, el) => {
    // Get the post
    let postBody = $(el).find(`div[id^='post_message_']`).html().trim();

    // Replace quotes with blockquotes
    cheerio.load(postBody)('div[style^="margin:20px; margin-top:5px; "]').each( (i, el) => {
        if ($(el).html().trim().startsWith('<div class="smallfont" style="margin-bottom:2px">Quote')) {
            let tbody = $(el).find('tbody > tr > td').html();
            let quote = $(el).find('tbody > tr > td > div');

            if (quote.html() && quote.text().trim().startsWith('Originally Posted by')) {
                let replacement = $('<blockquote>Hello</blockquote>');
                quote.parent().html().replace(quote.html(), replacement);
            }

            // Looks all good
            console.log($(el).html())
        }

        postBody = $(el).html();
    });
});

And lastly, more HTML for some context:

<div id="post_message_123456">
    As Username has previously written
    <br>
    <div style="margin:20px; margin-top:5px; ">
        <div class="smallfont" style="margin-bottom:2px">Quote:</div>
        <table cellpadding="6" cellspacing="0" border="0" width="100%">
            <tbody>
                <tr>
                    <td class="alt2" style="border:1px solid #999">

                        <div>
                            Originally Posted by <strong>Username</strong>
                        </div>
                        <div style="font-style:italic">Lorem ipsum dolor sit amet</div>
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
    <br>
    I think he has a point!
    <img src="smile-with-sunglasses.gif" />
</div>

The replacement itself seems to work, the output of the console.log() statement looks all good. The problem lies in the last line, where I'm trying to replace the original content with the replacement. However, postBody looks like it did before. What am I doing wrong?

like image 773
idleberg Avatar asked Oct 06 '18 13:10

idleberg


People also ask

How do I select HTML element in node JS?

To select a <select> element, you use the DOM API like getElementById() or querySelector() . How it works: First, select the <button> and <select> elements using the querySelector() method. Then, attach a click event listener to the button and show the selected index using the alert() method when the button is clicked.

How do I install Cheerio on Node JS?

Cheerio can be used on any ES6+, TypeScript, and Node.js project, but for this article, we will focus on Node.js. To get started, we need to run the npm init -y command, which will generate a new package.json file with its contents like below: One way to verify that the installation was successful is by checking the package.json file.

How to get the title of an HTML document using Cheerio?

We install cheerio, request, and local-web-server . Inside the project directory, where we have the index.html file, we start the local web server. It automatically serves the index.html file on three different locations. In the first example, we get the title of the document. The example prints the title of the HTML document.

How do I use Cheerio with gettables?

Our getTables function is utilising Cheerio to load in the HTML, run a CSS selector over the HTML, and then return a Cheerio representation of those tables. We're then logging to the console the HTML for each of those table elements, which looks like this:

How to scrape a web page using Cheerio?

If you want to use cheerio for scraping a web page, you need to first fetch the markup using packages like axios or node-fetch among others. In this section, you will learn how to scrape a web page using cheerio.


2 Answers

Try it like this:

let $ = cheerio.load(html)

$('.alt2 div:contains("Originally Posted by")').replaceWith('<blockquote>Lorem ipsum dolor sit amet</blockquote>')

console.log($.html())
like image 86
pguardiario Avatar answered Oct 20 '22 10:10

pguardiario


Replace items based on individual context

This demonstrates how you could swap out insecure with secure URLs as a useful real-world example and also make programatic decisions that is much easier to do than with regex for most normal humans.

const $ = cheerio.load(html)
// example replace all http:// with https://
$('img[src^="http://"]').replaceWith(function() {
  const src = $(this).attr('src')
  if (src.indexOf('s3.amazon.com')) {
    src = src.replace('s3.amazon.com', 'storage.azure')
  }
  return $(this).attr('src', src.replace('http://', 'https://'))
})
like image 30
King Friday Avatar answered Oct 20 '22 11:10

King Friday