I have a challenging problem to solve. I'm working on a script which takes a regex as an input. This script then finds all matches for this regex in a document and wraps each match in its own <span> element. The hard part is that the text is a formatted html document, so my script needs to navigate through the DOM and apply the regex across multiple text nodes at once, while figuring out where it has to split text nodes if needed.
For example, with a regex that captures full sentences starting with a capital letter and ending with a period, this document:
<p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p>
Would be turned into this:
<p> <span><b>HTML</b> is a language used to make <b>websites.</b></span> <span>It was developed by <i>CERN</i> employees in the early 90s.</span> <p>
The script then returns the list of all created spans.
I already have some code which finds all the text nodes and stores them in a list along with their position across the whole document and their depth. You don't really need to understand that code to help me and its recursive structure can be a bit confusing. The first part I'm not sure how to do is figure out which elements should be included within the span.
function SmartNode(node, depth, start) { this.node = node; this.depth = depth; this.start = start; } function findTextNodes(node, depth, start) { var list = []; var start = start || 0; depth = (typeof depth !== "undefined" ? depth : -1); if(node.nodeType === Node.TEXT_NODE) { list.push(new SmartNode(node, depth, start)); } else { for(var i=0; i < node.childNodes.length; ++i) { list = list.concat(findTextNodes(node.childNodes[i], depth+1, start)); if(list.length) start += list[list.length-1].node.nodeValue.length; } } return list; }
I figure I'll make a string out of all the document, run the regex through it and use the list to find which nodes correspond to witch regex matches and then split the text nodes accordingly.
But an issue arrives when I have a document like this:
<p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p>
There's a sentence which starts outside of the <a>
tag but ends inside it. Now I don't want the script to split that link in two tags. In a more complex document, it could ruin the page if it did. The code could either wrap two sentences together:
<p> <span>This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a></span> </p>
Or just wrap each part in its own element:
<p> <span>This program is </span> <a href="beta.html"> <span>not stable yet.</span> <span>Do not use this in production yet.</span> </a> </p>
There could be a parameter to specify what it should do. I'm just not sure how to figure out when an impossible cut is about to happen, and how to recover from it.
Another issue comes when I have whitespace inside a child element like this:
<p>This is a <b>sentence. </b></p>
Technically, the regex match would end right after the period, before the end of the <b>
tag. However, it would be much better to consider the space as part of the match and wrap it like this:
<p><span>This is a <b>sentence. </b></span></p>
Than this:
<p><span>This is a </span><b><span>sentence.</span> </b></p>
But that's a minor issue. After all, I could just allow extra white-space to be included within the regex.
I know this might sound like a "do it for me" question and its not the kind of quick question we see on SO on a daily basis, but I've been stuck on this for a while and it's for an open-source library I'm working on. Solving this problem is the last obstacle. If you think another SE site is best suited for this question, redirect me please.
To wrap text in a canvas element with JavaScript, we have to do the calculation for wrapping the text ourselves. to create the canvas. Then we write: const wrapText = (ctx, text, x, y, maxWidth, lineHeight) => { const words = text.
You can add JavaScript code in an HTML document by employing the dedicated HTML tag <script> that wraps around JavaScript code. The <script> tag can be placed in the <head> section of your HTML or in the <body> section, depending on when you want the JavaScript to load.
You have to set 'display:inline-block' and 'height:auto' to wrap the content within the border.
To wrap text in a canvas element with JavaScript, we have to do the calculation for wrapping the text ourselves. to create the canvas.
Definition and Usage. The wrap property sets or returns the value of the wrap attribute of a text area. The wrap attribute specifies how the text in a text area is to be wrapped when submitted in a form.
Find out how the text in a text area should be wrapped when submitting a form: The wrap property sets or returns the value of the wrap attribute of a text area. The wrap attribute specifies how the text in a text area is to be wrapped when submitted in a form.
In JavaScript, you can choose single quotes or double quotes to wrap your strings in. Both of the following will work okay: let sgl = 'Single quotes.'; let dbl = "Double quotes"; sgl; dbl; There is very little difference between the two, and which you use is down to personal preference.
Here are two ways to deal with this.
I don't know if the following will exactly match your needs. It's a simple enough solution to the problem, but at least it doesn't use RegEx to manipulate HTML tags. It performs pattern matching against the raw text and then uses the DOM to manipulate the content.
This approach creates only one <span>
tag per match, leveraging some less common browser APIs.
(See the main problem of this approach below the demo, and if not sure, use the second approach).
The Range
class represents a text fragment. It has a surroundContents
function that lets you wrap a range in an element. Except it has a caveat:
This method is nearly equivalent to
newNode.appendChild(range.extractContents()); range.insertNode(newNode)
. After surrounding, the boundary points of the range includenewNode
.An exception will be thrown, however, if the
Range
splits a non-Text
node with only one of its boundary points. That is, unlike the alternative above, if there are partially selected nodes, they will not be cloned and instead the operation will fail.
Well, the workaround is provided in the MDN, so all's good.
So here's an algorithm:
Text
nodes and keep their start indices in the texttext
Find matches over the text, and for each match:
Range
over the matchHere's my implementation with a demo:
function highlight(element, regex) { var document = element.ownerDocument; var getNodes = function() { var nodes = [], offset = 0, node, nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false); while (node = nodeIterator.nextNode()) { nodes.push({ textNode: node, start: offset, length: node.nodeValue.length }); offset += node.nodeValue.length } return nodes; } var nodes = getNodes(nodes); if (!nodes.length) return; var text = ""; for (var i = 0; i < nodes.length; ++i) text += nodes[i].textNode.nodeValue; var match; while (match = regex.exec(text)) { // Prevent empty matches causing infinite loops if (!match[0].length) { regex.lastIndex++; continue; } // Find the start and end text node var startNode = null, endNode = null; for (i = 0; i < nodes.length; ++i) { var node = nodes[i]; if (node.start + node.length <= match.index) continue; if (!startNode) startNode = node; if (node.start + node.length >= match.index + match[0].length) { endNode = node; break; } } var range = document.createRange(); range.setStart(startNode.textNode, match.index - startNode.start); range.setEnd(endNode.textNode, match.index + match[0].length - endNode.start); var spanNode = document.createElement("span"); spanNode.className = "highlight"; spanNode.appendChild(range.extractContents()); range.insertNode(spanNode); nodes = getNodes(); } } // Test code var testDiv = document.getElementById("test-cases"); var originalHtml = testDiv.innerHTML; function test() { testDiv.innerHTML = originalHtml; try { var regex = new RegExp(document.getElementById("regex").value, "g"); highlight(testDiv, regex); } catch(e) { testDiv.innerText = e; } } document.getElementById("runBtn").onclick = test; test();
.highlight { background-color: yellow; border: 1px solid orange; border-radius: 5px; } .section { border: 1px solid gray; padding: 10px; margin: 10px; }
<form class="section"> RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button> </form> <div id="test-cases" class="section"> <div>foo bar baz</div> <p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p> <p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p> <div>foo bar baz</div> </div>
Ok, that was the lazy approach which, unfortunately doesn't work for some cases. It works well if you only highlight across inline elements, but breaks when there are block elements along the way because of the following property of the extractContents
function:
Partially selected nodes are cloned to include the parent tags necessary to make the document fragment valid.
That's bad. It'll just duplicate block-level nodes. Try the previous demo with the baz\s+HTML
regex if you want to see how it breaks.
This approach iterates over the matching nodes, creating <span>
tags along the way.
The overall algorithm is straightforward as it just wraps each matching node in its own <span>
. But this means we have to deal with partially matching text nodes, which requires some more effort.
If a text node matches partially, it's split with the splitText
function:
After the split, the current node contains all the content up to the specified offset point, and a newly created node of the same type contains the remaining text. The newly created node is returned to the caller.
function highlight(element, regex) { var document = element.ownerDocument; var nodes = [], text = "", node, nodeIterator = document.createNodeIterator(element, NodeFilter.SHOW_TEXT, null, false); while (node = nodeIterator.nextNode()) { nodes.push({ textNode: node, start: text.length }); text += node.nodeValue } if (!nodes.length) return; var match; while (match = regex.exec(text)) { var matchLength = match[0].length; // Prevent empty matches causing infinite loops if (!matchLength) { regex.lastIndex++; continue; } for (var i = 0; i < nodes.length; ++i) { node = nodes[i]; var nodeLength = node.textNode.nodeValue.length; // Skip nodes before the match if (node.start + nodeLength <= match.index) continue; // Break after the match if (node.start >= match.index + matchLength) break; // Split the start node if required if (node.start < match.index) { nodes.splice(i + 1, 0, { textNode: node.textNode.splitText(match.index - node.start), start: match.index }); continue; } // Split the end node if required if (node.start + nodeLength > match.index + matchLength) { nodes.splice(i + 1, 0, { textNode: node.textNode.splitText(match.index + matchLength - node.start), start: match.index + matchLength }); } // Highlight the current node var spanNode = document.createElement("span"); spanNode.className = "highlight"; node.textNode.parentNode.replaceChild(spanNode, node.textNode); spanNode.appendChild(node.textNode); } } } // Test code var testDiv = document.getElementById("test-cases"); var originalHtml = testDiv.innerHTML; function test() { testDiv.innerHTML = originalHtml; try { var regex = new RegExp(document.getElementById("regex").value, "g"); highlight(testDiv, regex); } catch(e) { testDiv.innerText = e; } } document.getElementById("runBtn").onclick = test; test();
.highlight { background-color: yellow; } .section { border: 1px solid gray; padding: 10px; margin: 10px; }
<form class="section"> RegEx: <input id="regex" type="text" value="[A-Z].*?\." /> <button id="runBtn">Highlight</button> </form> <div id="test-cases" class="section"> <div>foo bar baz</div> <p> <b>HTML</b> is a language used to make <b>websites.</b> It was developed by <i>CERN</i> employees in the early 90s. <p> <p> This program is <a href="beta.html">not stable yet. Do not use this in production yet.</a> </p> <div>foo bar baz</div> </div>
This should be good enough for most cases I hope. If you need to minimize the number of <span>
tags it can be done by extending this function, but I wanted to keep it simple for now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With