Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match URL

Tags:

regex

I am using the following regex to match a URL:

$search  = "/([\S]+\.(MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|au|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BL|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|EH|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MF|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|R|H|TP|TR|TT|TV|TW|TZ|UA|UG|UK|UM|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|YU|ZA|ZM|ZW)([\S]*))/i";

But its a bit screwed up because it also matches "abc.php" which I dont want. and something like abc...test. I want it to match abc.com though. and www.abc.com as well as http://abc.com.

It just needs a slight tweak at the end but I am not sure what. (there should be a slash after the any domain name which it is not checking for right now and it is only checking \S)

thank you for your time.

like image 274
Alec Smart Avatar asked Jul 17 '09 07:07

Alec Smart


People also ask

How do you match a URL in RegEx?

@:%_\+~#= , to match the domain/sub domain name. In this solution query string parameters are also taken care. If you are not using RegEx , then from the expression replace \\ by \ . Hope this helps.

Can I use RegEx in URL?

URL regular expressions can be used to verify if a string has a valid URL format as well as to extract an URL from a string.

What is RegEx URL?

URL Regular Expession tutorial. Regular expressions are a combination of characters that are used to define a search pattern.

How do I find the URL of a string?

In Java, this can be done by using Pattern. matcher(). Find the substring from the first index of match result to the last index of the match result and add this substring into the list. After completing the above steps, if the list is found to be empty, then print “-1” as there is no URL present in the string S.


2 Answers

$search  = "#^((?#     the scheme:   )(?:https?://)(?#     second level domains and beyond:   )(?:[\S]+\.)+((?#     top level domains:   )MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#   )COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#   )A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#   )C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#   )E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#   )H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#   )K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#   )N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#   )S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#   )U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#     the path, can be there or not:   )(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i"; 

Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost) and with the http:// declared.

Finally you should end with the path part, that will always start with a /, if it is there.

However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.

Cheers,

By the way: Your regexp will also match something.r and something.h (between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.

On re-reading the question: Change

  )(?:https?://)(?# 

to

  )(?:https?://)?(?# 

(there is a ? extra) to match 'URLs' without the scheme.

like image 130
Boldewyn Avatar answered Sep 25 '22 04:09

Boldewyn


Not exactly what the OP asked for but this is a much simpler regular expression that does not need to be updated each time the IANA introduces a new TLD. I believe this is more adequate for most simple needs:

^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$

no list of TLD, localhost is not matched, the number of subparts must be >= 2 and the length of each subpart must be >= 2 (fx: "a.a" will not match but "a.ab" will match).

like image 36
Diego Perini Avatar answered Sep 22 '22 04:09

Diego Perini