Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URL replace with anchor, not replacing existing anchors

I'm building code matching and replacing several types of patterns (bbCode). One of the matches I'm trying to make, is [url=http:example.com] replacing all with anchor links. I'm also trying to match and replace plain textual urls with anchor links. And the combination of these two is where I'm running in to some trouble.

Since my routine is recursive, matching and replacing the entire text each run, I'm having trouble NOT replacing urls already contained in anchors.

This is the recursive routine I'm running:

if(text.search(p.pattern) !== -1) {
    text = text.replace(p.pattern, p.replace);
}

This is my regexp for plain urls so far:

/(?!href="|>)(ht|f)tps?:\/\/.*?(?=\s|$)/ig

And URLs can start with http or https or ftp or ftps, and contain whatever text afterwards, ending with whitespace or a punctuation mark (. / ! / ? / ,)

Just to be absolutely clear, I'm using this as a test for matches:

Should match:

  • http://www.example.com
  • http://www.example.com/test
  • http://example.com/test
  • www.example.com/test

Should not match

  • <a href="http://www.example.com">http://www.example.com </a>
  • <a href="http://www.example.com/test">http://www.example.com/test </a>
  • <a href="http://example.com/test">http://example.com/test </a>
  • <a href="www.example.com/test">www.example.com/test </a>

I would really appretiate any help I can get here.

EDIT The first accepted solution by jkshah below does have some flaws. For instance, it will match

<img src="http://www.example.com/test.jpg">

The comments in Jerry's solution however did make me want to try it again, and that solution solved this issue as well. I therefore accepted that solution instead. Thank you all for your kind help on this. :)

like image 808
Øystein Amundsen Avatar asked Sep 14 '25 07:09

Øystein Amundsen


2 Answers

Maybe something like this?

/(?:(?:ht|f)tps?:\/\/|www)[^<>\]]+?(?![^<>\]]*([>]|<\/))(?=[\s!,?\]]|$)/gm

And then trim the dots at the end if any.

regex101 demo

Though if the link contains more punctuations, it might cause some issues... I would then suggest capturing the link first, then remove the trailing punctuations with a second replace.

[^<>\]]+ will match every character except <, > and ]

(?![^<>\]]*([>]|<\/)) prevents the matching of a link between html tags.

(?=[\s!,?\]]|$) is for the punctuations and whitespace.

like image 131
Jerry Avatar answered Sep 17 '25 00:09

Jerry


Following regex should work. It's giving desired result on your sample inputs.

/((?:(?:ht|f)tps?:\/\/|www)[^\s,?!]+(?!.*<\/a>))/gm

See it in action here

(?!.*<\/a>) - Negative lookahead for anchor

Matching content will be stored in $1 and can be used in replace string.

EDIT

To not match content with <img src .. following can be used

(^(?!.*<img\s+src)(?:(?:ht|f)tps?:\/\/|www)[^\s,?!]+(?!.*<\/a>))
like image 28
jkshah Avatar answered Sep 17 '25 00:09

jkshah