Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove <div> and <br> using Cheerio js?

I have the following html that I like to parse through Cheerios.

    var $ = cheerio.load('<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>This works well.</div><div><br clear="none"/></div><div>So I have been doing this for several hours. How come the space does not split? Thinking that this could be an issue.</div><div>Testing next paragraph.</div><div><br clear="none"/></div><div>Im testing with another post. This post should work.</div><div><br clear="none"/></div><h1>This is for test server.</h1></body></html>', {
    normalizeWhitespace: true,
});

// trying to parse the html
// the goals are to 
// 1. remove all the 'div'
// 2. clean up <br clear="none"/> into <br>
// 3. Have all the new 'empty' element added with 'p'

var testData = $('div').map(function(i, elem) {
    var test = $(elem)
    if ($(elem).has('br')) {
        console.log('spaceme');
        var test2 = $(elem).removeAttr('br');
    } else {
        var test2 = $(elem).removeAttr('div').add('p');
    }
    console.log(i +' '+ test2.html());
    return test2.html()
})

res.send(test2.html())

My end goals are to try and parse the html

  • remove all the div
  • clean up <br clear="none"/> and change into <br>
  • and finally have all the empty 'element' (those sentences with 'div') remove to be added with 'p' sentence '/p'

I try to start with a smaller goal in the above code I have written. I tried to remove all the 'div' (it is a success) but I'm unable to to find the 'br. I been trying out for days and have no head way.

So I'm writing here to seek some help and hints on how can I get to my end goal.

Thank you :D

like image 504
bosslee Avatar asked Mar 01 '15 05:03

bosslee


1 Answers

It's easier than it looks, first you iterate over all the DIV's

$('div').each(function() { ...

and for each div, you check if it has a <br> tag

$(this).find('br').length

if it does, you remove the attribute

$(this).find('br').removeAttr('clear');

if not you create a P with the same content

var p = $('<p>' + $(this).html() + '</p>');

and then just replace the DIV with the P

$(this).replaceWith(p);

and output

res.send($.html());

All together it's

$('div').each(function() {
    if ( $(this).find('br').length ) {
        $(this).find('br').removeAttr('clear');
    } else {
        var p = $('<p>' + $(this).html() + '</p>');
        $(this).replaceWith(p);
    }
});

res.send($.html());
like image 160
adeneo Avatar answered Oct 25 '22 20:10

adeneo