Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex for URL including query string

Tags:

c#

.net

regex

url

I thought this would be a simple google search but apparently not. What is a regex I can use in C# to parse out a URL including any query string from a larger text? I have spent lots of time and found lots of examples of ones that don't include the query string. And I can't use System.URI, because that assumes you already have the URL... I need to find it in surrounding text.

like image 786
JoelFan Avatar asked Feb 26 '10 16:02

JoelFan


People also ask

Does URL include query string?

A query string is a part of a uniform resource locator (URL) that assigns values to specified parameters.

How do you validate a query string?

Query string values can be checked using regular expressions. You can select regular expressions from the global White list or enter them manually. For example, if you know that a query string must have a value of ABCD , a regular expression of ^ABCD$ is an exact match test.

What are query string parameters in URL?

URL parameters (known also as “query strings” or “URL query parameters”) are elements inserted in your URLs to help you filter and organize content or track information on your website. To identify a URL parameter, refer to the portion of the URL that comes after a question mark (?).


1 Answers

This should get just about anything (feel free to add additional protocols):

@"(https?|ftp|file)\://[A-Za-z0-9\.\-]+(/[A-Za-z0-9\?\&\=;\+!'\(\)\*\-\._~%]*)*"

The real difficulty is finding the end. As is, this pattern relies on finding an invalid character. That would be anything other than letters, numbers, hyphen or period before the end of the domain name, or anything other than those plus forward slash (/), question mark (?), ampersand (&), equals sign (=), semicolon (;), plus sign (+), exclamation point (!), apostrophe/single quote ('), open/close parentheses, asterisk (*), underscore (_), tilde (~), or percent sign (%) after the domain name.

Note that this would allow invalid URLs like

http://../

And it would pick up stuff after a URL, such as in this string:

Maybe you should try http://www.google.com.

Where "http://www.google.com." (with the trailing period) would be matched.

It would also miss URLs that didn't begin with a protocol specification (specifically, the protocols within the first set of parentheses. For instance, it would miss the URL in this string:

Maybe you should try www.google.com.

It's very difficult to get every case without some better-defined boundaries.

like image 192
P Daddy Avatar answered Oct 25 '22 12:10

P Daddy