Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression to match both relative and absolute URLs

Tags:

regex

Anyone want to try their hand at coming up with a regex that matches both:

  • /foo/bar/baz.gif
  • /foo/bar/
  • http://www.foo.com/foo/bar

I think it might be impossible to do it with one regex, but you never know.

EDIT: To clarify, what I'm trying to do is pick out all URI's from a document (Not a HTML document).

like image 276
FlySwat Avatar asked Jun 15 '09 22:06

FlySwat


1 Answers

(
  ((http|https|ftp)://([\w-\d]+\.)+[\w-\d]+){0,1}  // Capture domain names or IP addresses
  (/[\w~,;\-\./?%&+#=]*)                // Capture paths, including relative
)

Rationale for this answer:

  1. The whole thing is grouped so you can pick out the entire URL
  2. The protocol portion is optional, but if provided, a hostname or IP address should also be provided (both of which have fewer allowed characters than the rest of the URI).
  3. The "/" at the beginning is also optional. Paths can be in the form "images/1.gif", which are relative to the current path rather than relative to the hostname.

Caveats:

  1. mailto and file URIs not supported.
  2. URLs trailed by a period (such as at the end of a sentence without quotation) will include the trailing period.
  3. Because of #3 above, it's going to capture all sorts of things. If you can verify that all paths are not relative, you can add a "/" outside the parenthesis and thus require it.
  4. If all URIs are within HTML attributes (A, LINK, IMG, etc.), you can target the URIs much more accurately by only capturing within quotes, or at least only within HTML tags.

Edit: whoops, fixed closing paren problem.

like image 131
richardtallent Avatar answered Sep 20 '22 06:09

richardtallent