Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need a good regex to convert URLs to links but leave existing links alone

Tags:

html

regex

url

php

I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.

I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?

like image 208
Nick Locking Avatar asked Nov 13 '08 14:11

Nick Locking


5 Answers

Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]

In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get

(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]

This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...

like image 191
Tim Pietzcker Avatar answered Sep 30 '22 11:09

Tim Pietzcker


This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:

(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]

With the following input:

http://www.google.com
http://google.com
www.google.com

<p>http://www.google.com<p>

this is a normal sentence. let's hope it's ok.

<a href="http://www.google.com">www.google.com</a>

This is the output of a preg_replace:

<a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
<a href="http://google.com" rel="nofollow">http://google.com</a>
<a href="www.google.com" rel="nofollow">www.google.com</a>

<p><a href="http://www.google.com" rel="nofollow">http://www.google.com</a><p>

this is a normal sentence. let's hope it's ok.

<a href="http://www.google.com">www.google.com</a>

Just wanted to contribute back to save somebody some time.

like image 40
Matt Avatar answered Sep 30 '22 11:09

Matt


I made a slight modification to the Regex contained in the original answer:

(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]

which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:

$convertedText = preg_replace( '@(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]@i', '<a href="\0" target="_blank">\0</a>', $originalText );

Note, I removed @ from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that @ would be used in a URL anyway.

Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.

Hope that helps.

like image 41
Hodge Avatar answered Sep 30 '22 12:09

Hodge


To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:

/(?<!href=")http://\S*/

Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.

like image 37
Nicole Avatar answered Sep 30 '22 10:09

Nicole


if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[A-Z0-9+&@#\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
    # Successful match
} else {
    # Match attempt failed
}
like image 26
RUX Avatar answered Sep 30 '22 11:09

RUX