Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# regex pattern to extract urls from given string - not full html urls but bare links as well

I need a regex which will do the following

Extract all strings which starts with http:// Extract all strings which starts with www. 

So i need to extract these 2.

For example there is this given string text below

house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue 

So from the given above string i will get

    www.monstermmorpg.com http://www.monstermmorpg.com http://www.monstermmorpg.commerged 

Looking for regex or another way. Thank you.

C# 4.0

like image 579
MonsterMMORPG Avatar asked May 14 '12 01:05

MonsterMMORPG


People also ask

What do you mean by C?

" " C is a computer programming language. That means that you can use C to create lists of instructions for a computer to follow. C is one of thousands of programming languages currently in use.

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

What is C in coding language?

C is a powerful general-purpose programming language. It can be used to develop software like operating systems, databases, compilers, and so on. C programming is an excellent language to learn to program for beginners. Our C tutorials will guide you to learn C programming one step at a time.

Is C programming hard?

C is more difficult to learn than JavaScript, but it's a valuable skill to have because most programming languages are actually implemented in C. This is because C is a “machine-level” language. So learning it will teach you how a computer works and will actually make learning new languages in the future easier.


1 Answers

You can write some pretty simple regular expressions to handle this, or go via more traditional string splitting + LINQ methodology.

Regex

var linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase); var rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue"; foreach(Match m in linkParser.Matches(rawString))     MessageBox.Show(m.Value); 

Explanation Pattern:

\b       -matches a word boundary (spaces, periods..etc) (?:      -define the beginning of a group, the ?: specifies not to capture the data within this group. https?://  - Match http or https (the '?' after the "s" makes it optional) |        -OR www\.    -literal string, match www. (the \. means a literal ".") )        -end group \S+      -match a series of non-whitespace characters. \b       -match the closing word boundary. 

Basically the pattern looks for strings that start with http:// OR https:// OR www. (?:https?://|www\.) and then matches all the characters up to the next whitespace.

Traditional String Options

var rawString = "house home go www.monstermmorpg.com nice hospital http://www.monstermmorpg.com this is incorrect url http://www.monstermmorpg.commerged continue"; var links = rawString.Split("\t\n ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries).Where(s => s.StartsWith("http://") || s.StartsWith("www.") || s.StartsWith("https://")); foreach (string s in links)     MessageBox.Show(s); 
like image 156
Jason Larke Avatar answered Sep 28 '22 09:09

Jason Larke