Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to find URLs within a string

Tags:

string

regex

url

People also ask

How do I find the URL of a string?

In Java, this can be done by using Pattern. matcher(). Find the substring from the first index of match result to the last index of the match result and add this substring into the list. After completing the above steps, if the list is found to be empty, then print “-1” as there is no URL present in the string S.

What is regex for URL?

URL regular expressions can be used to verify if a string has a valid URL format as well as to extract an URL from a string.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).


This is the one I use

(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])

Works for me, should work for you too.


Guess no regex is perfect for this use. I found a pretty solid one here

/(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:,.]*\)|[A-Z0-9+&@#\/%=~_|$])/igm

Some differences / advantages compared to the other ones posted here:

  • It does not match email addresses
  • It does match localhost:12345
  • It won't detect something like moo.com without http or www

See here for examples


text = """The link of this question: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string
Also there are some urls: www.google.com, facebook.com, http://test.com/method?param=wasd, http://test.com/method?param=wasd&params2=kjhdkjshd
The code below catches all urls in text and returns urls in list."""

urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-&?=%.]+', text)
print(urls)

Output:

[
    'https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string', 
    'www.google.com', 
    'facebook.com',
    'http://test.com/method?param=wasd',
    'http://test.com/method?param=wasd&params2=kjhdkjshd'
]

None of the solutions provided here solved the problems/use-cases I had.

What I have provided here, is the best I have found/made so far. I will update it when I find new edge-cases that it doesn't handle.

\b
  #Word cannot begin with special characters
  (?<![@.,%&#-])
  #Protocols are optional, but take them with us if they are present
  (?<protocol>\w{2,10}:\/\/)?
  #Domains have to be of a length of 1 chars or greater
  ((?:\w|\&\#\d{1,5};)[.-]?)+
  #The domain ending has to be between 2 to 15 characters
  (\.([a-z]{2,15})
       #If no domain ending we want a port, only if a protocol is specified
       |(?(protocol)(?:\:\d{1,6})|(?!)))
\b
#Word cannot end with @ (made to catch emails)
(?![@])
#We accept any number of slugs, given we have a char after the slash
(\/)?
#If we have endings like ?=fds include the ending
(?:([\w\d\?\-=#:%@&.;])+(?:\/(?:([\w\d\?\-=#:%@&;.])+))*)?
#The last char cannot be one of these symbols .,?!,- exclude these
(?<![.,?!-])

Wrote one up myself:

let regex = /([\w+]+\:\/\/)?([\w\d-]+\.)*[\w-]+[\.\:]\w+([\/\?\=\&\#\.]?[\w-]+)*\/?/gm

It works on ALL of the following domains:

https://www.facebook.com
https://app-1.number123.com
http://facebook.com
ftp://facebook.com
http://localhost:3000
localhost:3000/
unitedkingdomurl.co.uk
this.is.a.url.com/its/still=going?wow
shop.facebook.org
app.number123.com
app1.number123.com
app-1.numbEr123.com
app.dashes-dash.com
www.facebook.com
facebook.com
fb.com/hello_123
fb.com/hel-lo
fb.com/hello/goodbye
fb.com/hello/goodbye?okay
fb.com/hello/goodbye?okay=alright
Hello www.google.com World http://yahoo.com
https://www.google.com.tr/admin/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
http://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
ftp://google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com.tr/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
www.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
drive.google.com/test/subPage?qs1=sss1&qs2=sss2&qs3=sss3#Services
https://www.example.pl
http://www.example.com
www.example.pl
example.com
http://blog.example.com
http://www.example.com/product
http://www.example.com/products?id=1&page=2
http://www.example.com#up
http://255.255.255.255
255.255.255.255
shop.facebook.org/derf.html

You can see how it performs here on regex101 and adjust as needed