Best algorithm to highlight a list of given words in an HTML file

Question

I have some HTML files, upon which I have no control. Thus I can't change their structure or markup.

For each of these HTML files, a list of words would be found based on another algorithm. These words should be highlighted in the text of HTML. For example if the HTML markup is:

<p>
Monkeys are going to die soon, if we don't stop killing them. 
So, we have to try hard to persuade hunters not to hunt monkeys. 
Monkeys are very intelligent, and they should survive. 
In fact, they deserve to survive.
</p>

and the list of the words is:

are, we, monkey

the result should be something like:

<p>
    <span class='highlight'>Monkeys</span> 
    <span class='highlight'>are</span> 
going to die soon, if 
    <span class='highlight'>we</span> 
don't stop killing them. 
So, 
    <span class='highlight'>we</span> 
have to try hard to persuade hunters 
not to hunt 
    <span class='highlight'>monkeys</span>
. They 
    <span class='highlight'>are</span> 
very intelligent, and they should survive. 
In fact, they deserve to survive.
</p>

The highlighting algorithm should:

be case-insensitive
be written in JavaScript (this happens inside browser) (jQuery is welcomed)
be fast (be applicable for the text of a given book with almost 800 pages)
not showing browser's famous "stop script" dialog
be applicable for dirty HTML files (like supporting invalid HTML markup, say for example unclosed
elements) (some of these files are HTML export of MS Word, and I think you got what I mean by dirty!!!)
should preserve original HTML markup (no markup deletion, no markup change except wrapping intended words inside an element, no nesting change. HTML should look the same before and after edit except highlighted words)

What I've done till now:

I get the list of words in JavaScript in an array like ["are", "we", "monkey"]
I try to select text nodes in the browser (which is faulty now)
I loop over each text node, and for each text node, I loop over each word in the list and try to find it and wrap it inside an element

Please note that you can watch it online here (username: demo@phis.ir, pass: demo). Also current script could be seen at the end of the page's source.

Andrew Tomazos · Accepted Answer

Concatenate your words with | into a string, and then interpret the string as a regex, and then substitute occurences with the full match surrounded by the highlight tags.

Amberlamps · Answer

The following regular expressions works for your example. Maybe you can pick it up from there:

"Monkeys are going to die soon, if we don't stop killing them. So, we have to try hard to persuade hunters not to hunt monkeys. Monkeys are very intelligent, and they should survive. In fact, they deserve to survive.".replace(/({we|are|monkey[s]?}*)([\s\.,])/gi, "<span class='highlight'>$1</span>$2")

Best algorithm to highlight a list of given words in an HTML file

Tags:

javascript

html

jquery

algorithm

Saeed Neamati

2 Answers

Andrew Tomazos

Amberlamps

Recent Activity

Donate For Us

Best algorithm to highlight a list of given words in an HTML file

Tags:

javascript

html

jquery

algorithm

Saeed Neamati

2 Answers

Andrew Tomazos

Amberlamps

Related questions

Recent Activity

Donate For Us