I'm trying to write a regular expression which will find URLs in a plain-text string, so that I can wrap them with anchor tags. I know there are expressions already available for this, but I want to create my own, mostly because I want to know how it works.
Since it's not going to break anything if my regex fails, my plan is to write something fairly simple. So far that means: 1) match "www" or "http" at the start of a word 2) keep matching until the word ends.
I can do that, AFAICT. I have this: \b(http|www).?[^\s]+
Which works on foo www.example.com bar http://www.example.com
etc.
The problem is that if I give it foo www.example.com, http://www.example.com
it thinks that the comma is a part of the URL.
So, if I am to use one expression to do this, I need to change "...and stop when you see whitespace" to "...and stop when you see whitespace or a piece of punctuation right before whitespace". This is what I'm not sure how to do.
At the moment, a solution I'm thinking of running with is just adding another test – matching the URL, and then on the next line moving any sneaky punctuation. This just isn't as elegant.
Note: I am writing this in PHP.
Aside: why does replacing \s
with \b
in the expression above not seem to work?
ETA:
Thanks everyone!
This is what I eventually ended up with, based on Explosion Pills's advice:
function add_links( $string ) {
function replace( $arr ) {
if ( strncmp( "http", $arr[1], 4) == 0 ) {
return "<a href=$arr[1]>$arr[1]</a>$arr[2]$arr[3]";
} else {
return "<a href=" . "http://" . $arr[1] . ">$arr[1]</a>$arr[2]$arr[3]";
}
}
return preg_replace_callback( '/\b((?:http|www).+?)((?!\/)[\p{P}]+)?(\s|$)/x', replace, $string );
}
I added a callback so that all of the links would start with http://, and did some fiddling with the way it handles punctuation.
It's probably not the Best way to do things, but it works. I've learned a lot about this in the last little while, but there is still more to learn!
preg_replace('/
\b # Initial word boundary
( # Start capture
(?: # Non-capture group
http|www # http or www (alternation)
) # end group
.+? # reluctant match for at least one character until...
) # End capture
( # Start capture
[,.]+ # ...one or more of either a comma or period.
# add more punctuation as needed
)? # End optional capture
(\s|$) # Followed by either a space character or end of string
/x', '<a href="\1">\1</a>\2\3'
...is probably what you are going for. I think it's still imperfect, but it should at least work for your needs.
Aside: I think this is because \b
matches punctuation too
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With