Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting all the text content from a HTML string in NodeJS

I need to get only the text content from a HTML String with a space or a line break separating the text content of different elements.

For example, the HTML String might be:

<ul>
  <li>First</li>
  <li>Second</li>
</ul>

What I want:

First Second

or

First
Second

I've tried to get the text content by first wrapping the entire string inside a div and then getting the textContent using third party libraries. But, there is no spacing or line breaks between text content of different elements which I specifically require (i.e. I get FirstSecond which is not what I want).

The only solution I am thinking of right now is to make a DOM Tree and then apply recursion to get the nodes that contain text, and then append the text of that element to a string with spaces. Are there any cleaner, neater, and simpler solution than this?

like image 845
thesamiroli Avatar asked Jun 20 '26 04:06

thesamiroli


2 Answers

Convert HTML to Plain Text:

In your terminal, install the html-to-text npm package:

npm install html-to-text

Then in JavaScript::

const { convert } = require('html-to-text'); // Import the library

var htmlString = `
<ul>
  <li>First</li>
  <li>Second</li>
</ul>
`;

var text = convert(htmlString, { wordwrap: 130 })
// Out:
// First
// Second
  • Hope this helps!
like image 68
Ramy Hadid Avatar answered Jun 21 '26 18:06

Ramy Hadid


You can try get rid of html tags using regex, for the yours example try the following:

let str = `<ul>
<li>First</li>
<li>Second</li>
</ul>`

console.log(str)

let regex = '<\/?!?(li|ul)[^>]*>'

var re = new RegExp(regex, 'g');

str = str.replace(re, '');
console.log(str)
like image 28
elvira.genkel Avatar answered Jun 21 '26 18:06

elvira.genkel