Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cheerio: Extract Text from HTML with separators

Let's say I have the following:

$ = cheerio.load('<html><body><ul><li>One</li><li>Two</li></body></html>');

var t = $('html').find('*').contents().filter(function() {
  return this.type === 'text';
}).text(); 

I get:

OneTwo

Instead of:

One Two

It's the same result I get if I do $('html').text(). So basically what I need is to inject a separator like (space) or \n

Notice: This is not a jQuery front-end question is more like NodeJS backend related issue with Cheerio and HTML parsing.

like image 781
Crisboot Avatar asked Jul 21 '15 15:07

Crisboot


2 Answers

This seems to do the trick:

var t = $('html *').contents().map(function() {
    return (this.type === 'text') ? $(this).text() : '';
}).get().join(' ');

console.log(t);

Result:

One Two

Just improved my solution a little bit:

var t = $('html *').contents().map(function() {
    return (this.type === 'text') ? $(this).text()+' ' : '';
}).get().join('');
like image 97
Crisboot Avatar answered Oct 21 '22 14:10

Crisboot


You can use the TextVersionJS package to generate the plain text version of an html string. You can use it on the browser and in node.js as well.

var createTextVersion = require("textversionjs");

var yourHtml = "<h1>Your HTML</h1><ul><li>goes</li><li>here.</li></ul>";

var textVersion = createTextVersion(yourHtml);

Download it from npm and require it with Browserify for example.

like image 4
Balint Avatar answered Oct 21 '22 15:10

Balint