Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert string that contains HTML to sentences and also keep separator using Javascript

Tags:

javascript

This is my string. It contains some HTML:

First sentence. Here is a <a href="http://google.com">Google</a> link in the second sentence! The third sentence might contain an image like this <img src="http://link.to.image.com/hello.png" /> and ends with !? The last sentence looks like <b>this</b>??

I want to split the string to sentences (array), keep the HTML as well as the separator. Like this:

[0] = First sentence.
[1] = Here is a <a href="http://google.com">Google</a> link in the second sentence!
[2] = The third sentence might contain an image like this <img src="http://link.to.image.com/hello.png" /> and ends with !?
[3] = The last sentence looks like <b>this</b>??

Can anybody suggest me a way to do this please? May be using Regex and match?

This is very close to what I’m after, but not really with the HTML bits: JavaScript Split Regular Expression keep the delimiter

like image 225
suprb Avatar asked Nov 12 '22 06:11

suprb


1 Answers

The easy part is the parsing; you can do this easily by wrapping an element around the string. Splitting the sentences is somewhat more intricate; this is my first stab at it:

var s = 'First sentence. Here is a <a href="http://google.com">Google.</a> link in the second sentence! The third sentence might contain an image like this <img src="http://link.to.image.com/hello.png" /> and ends with !? The last sentence looks like <b>this</b>??';

var wrapper = document.createElement('div');
wrapper.innerHTML = s;

var sentences = [],
buffer = [],
re = /[^.!?]+[.!?]+/g;

[].forEach.call(wrapper.childNodes, function(node) {
  if (node.nodeType == 1) {
    buffer.push(node.outerHTML); // save html
  } else if (node.nodeType == 3) {
    var str = node.textContent; // shift sentences
    while ((match = re.exec(str)) !== null) {
      sentences.push(buffer.join('') + match);
      buffer = [];
      str = str.substr(re.lastIndex + 1);
      re.lastIndex = 0; // reset regexp
    }
    buffer.push(str);
  }
});

if (buffer.length) {
  sentences.push(buffer.join(''));
}

console.log(sentences);

Demo

Every node that's either an element or unfinished sentence gets added to a buffer until a full sentence is found; it's then prepended to the result array.

like image 79
Ja͢ck Avatar answered Nov 14 '22 22:11

Ja͢ck