Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match a keyword on a web page that is NOT within an <a> and its href, using JavaScript?

I'm searching a page to find a specific keyword. That itself is easy enough. The added complication is that I don't want to match this keyword if it is part of an <a> tag.

E.g.

<p>Here is some example content that has a keyword in it. 
I want to match this keyword here but, i don't want to match 
the <a href="http://www.keyword.com">keyword</a> here.</p>

If you look at the above example content, the word 'keyword' appears 4 times. I want to match the first two times it appears with the paragraph, but i do not want to match it when it appears as part of the href and as part of the <a> content.

So far I've managed to use this below:

var tester = new RegExp("((?!<a.*?>)("+keyword+")(?!</a>))", 'ig');

The problem with that above is that it still matches the keyword if it is part of the href.

Any ideas? Thanks

like image 902
user589080 Avatar asked Jan 25 '11 14:01

user589080


People also ask

How do you link to a different part of the same page?

Use the #id selector from another page You can also jump to a specific part of another web page by adding #selector to the page's URL.

Can you link to a specific spot on a page?

Put the title into an opening HTML anchor link tag After you name the section you'd like to link, insert it into an opening HTML anchor link tag. Adding this tag creates an anchor link, which leads users to the specified section of your webpage.

How can we link web pages in HTML What are the different types of linking?

To make page links in an HTML page, use the <a> and </a> tags, which are the tags used to define the links. The <a> tag indicates where the link starts and the </a> tag indicates where it ends. Whatever text gets added inside these tags, will work as a link. Add the URL for the link in the <a href=” ”>.

What is href =# in HTML?

Definition and UsageThe href attribute specifies the URL of the page the link goes to. If the href attribute is not present, the <a> tag will not be a hyperlink. Tip: You can use href="#top" or href="#" to link to the top of the current page!


1 Answers

You can't reliably do this with JavaScript regexes. It's hard enough to do with the .NET regex engine that is one of the few to support infinite-length lookbehind assertions, but JavaScript doesn't know lookbehind assertions at all, so you can't look back to see what came before the text you do want to match.

So you should either use a DOM parser (I'm sure someone fluent in JavaScript can suggest a practical approach here), or read the text, remove all the <a> tags (which you sort of could do with a regex, if you're the brave type), and then search for your keyword in the rest of the text.

EDIT:

Well, there is a dirty hack that you could use. It's not pretty, and if you look at Alan Moore's comment to your question, you'll be able to imagine a multitude of ways in which this regex will fail, but it does work on your example:

/keyword(?!(?:(?!<a).)*</a)/

How does it "work"?

keyword    # Match "keyword"
(?!        # but only if it is not possible to match the following regex in the text ahead:
 (?:       # - Match...
  (?!<a)   # -- unless it's the start of an <a> tag...
  .        # -- any character
 )*        # - any number of times
 </a>      # then match a closing <a> tag. 
)          # End of lookahead assertion.

This is quite cryptic, even with the explanation. What it essentially does is:

  • Match "keyword"
  • Look ahead that there is no closing </a> in the following text
  • unless an opening <a> tag comes first.

So if all your <a> tags are correctly balanced, not nested, not found inside comments or script blocks, you might just get away with it.

like image 79
Tim Pietzcker Avatar answered Sep 25 '22 18:09

Tim Pietzcker