Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cheerio - Get text with html tags replaced by white spaces

Today we're using Cheerio's and notably the method .text() to extract text from a html input.

But when html is

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

Visually on the page, the /div after the word "by" ensures there is a space or a line break. but when applying cheerio text(), we get as result sth that is wrong:

ByJohn smith => which is wrong as we need a white space between By and john.

Generally speaking, is it possible to get the text but in a little special way so that ANY html tag is replaced by a white space. (I'm OK to trim afterwards all multiple whites spaces ...)

We'd like to have as output By John smith

like image 447
Mathieu Avatar asked Oct 15 '25 18:10

Mathieu


2 Answers

You could use the following regex to replace all HTML tags with a space:

/<\/?[a-zA-Z0-9=" ]*>/g

So when you replace your HTML with this regex, it may produce multiple spaces. In that case you can use replace(/\s\s+/g, ' ') to replace all spaces with a single space.

See the result:

console.log(document.querySelector('div').innerHTML.replaceAll(/<\/?[a-zA-Z0-9=" ]*>/g, ' ').replace(/\s\s+/g, ' ').trim())
<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>
like image 124
Reza Saadati Avatar answered Oct 18 '25 12:10

Reza Saadati


You can use pure JavaScript for this task.

const parent = document.querySelector('div');
console.log(parent.innerText.replace(/(\r\n|\n|\r)/gm, " "))
<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>
like image 31
Maik Lowrey Avatar answered Oct 18 '25 11:10

Maik Lowrey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!