Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best algorithm to highlight a list of given words in an HTML file

I have some HTML files, upon which I have no control. Thus I can't change their structure or markup.

For each of these HTML files, a list of words would be found based on another algorithm. These words should be highlighted in the text of HTML. For example if the HTML markup is:

<p>
Monkeys are going to die soon, if we don't stop killing them. 
So, we have to try hard to persuade hunters not to hunt monkeys. 
Monkeys are very intelligent, and they should survive. 
In fact, they deserve to survive.
</p>

and the list of the words is:

are, we, monkey

the result should be something like:

<p>
    <span class='highlight'>Monkeys</span> 
    <span class='highlight'>are</span> 
going to die soon, if 
    <span class='highlight'>we</span> 
don't stop killing them. 
So, 
    <span class='highlight'>we</span> 
have to try hard to persuade hunters 
not to hunt 
    <span class='highlight'>monkeys</span>
. They 
    <span class='highlight'>are</span> 
very intelligent, and they should survive. 
In fact, they deserve to survive.
</p>

The highlighting algorithm should:

  1. be case-insensitive
  2. be written in JavaScript (this happens inside browser) (jQuery is welcomed)
  3. be fast (be applicable for the text of a given book with almost 800 pages)
  4. not showing browser's famous "stop script" dialog
  5. be applicable for dirty HTML files (like supporting invalid HTML markup, say for example unclosed

    elements) (some of these files are HTML export of MS Word, and I think you got what I mean by dirty!!!)

  6. should preserve original HTML markup (no markup deletion, no markup change except wrapping intended words inside an element, no nesting change. HTML should look the same before and after edit except highlighted words)

What I've done till now:

  1. I get the list of words in JavaScript in an array like ["are", "we", "monkey"]
  2. I try to select text nodes in the browser (which is faulty now)
  3. I loop over each text node, and for each text node, I loop over each word in the list and try to find it and wrap it inside an element

Please note that you can watch it online here (username: [email protected], pass: demo). Also current script could be seen at the end of the page's source.

like image 652
Saeed Neamati Avatar asked Nov 05 '12 12:11

Saeed Neamati


2 Answers

Concatenate your words with | into a string, and then interpret the string as a regex, and then substitute occurences with the full match surrounded by the highlight tags.

like image 77
Andrew Tomazos Avatar answered Oct 26 '22 08:10

Andrew Tomazos


The following regular expressions works for your example. Maybe you can pick it up from there:

"Monkeys are going to die soon, if we don't stop killing them. So, we have to try hard to persuade hunters not to hunt monkeys. Monkeys are very intelligent, and they should survive. In fact, they deserve to survive.".replace(/({we|are|monkey[s]?}*)([\s\.,])/gi, "<span class='highlight'>$1</span>$2")
like image 26
Amberlamps Avatar answered Oct 26 '22 09:10

Amberlamps