Today we're using Cheerio's and notably the method .text() to extract text from a html input.
But when html is
<div>
By<div><h2 class="authorh2">John Smith</h2></div>
</div>
Visually on the page, the /div after the word "by" ensures there is a space or a line break. but when applying cheerio text(), we get as result sth that is wrong:
ByJohn smith
=> which is wrong as we need a white space between By and john.
Generally speaking, is it possible to get the text but in a little special way so that ANY html tag is replaced by a white space. (I'm OK to trim afterwards all multiple whites spaces ...)
We'd like to have as output By John smith
You could use the following regex to replace all HTML tags with a space:
/<\/?[a-zA-Z0-9=" ]*>/g
So when you replace your HTML with this regex, it may produce multiple spaces. In that case you can use replace(/\s\s+/g, ' ')
to replace all spaces with a single space.
See the result:
console.log(document.querySelector('div').innerHTML.replaceAll(/<\/?[a-zA-Z0-9=" ]*>/g, ' ').replace(/\s\s+/g, ' ').trim())
<div>
By<div><h2 class="authorh2">John Smith</h2></div>
</div>
You can use pure JavaScript for this task.
const parent = document.querySelector('div');
console.log(parent.innerText.replace(/(\r\n|\n|\r)/gm, " "))
<div>
By<div><h2 class="authorh2">John Smith</h2></div>
</div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With