Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for parsing links from a webpage?

Tags:

html

.net

regex

I'm looking for a .NET regular expression extract all the URLs from a webpage but haven't found one to be comprehensive enough to cover all the different ways you can specify a link.

And a side question:

Is there one regex to rule them all? Or am I better off using a series of less complicated regular expressions and just using mutliple passes against the raw HTML? (Speed vs. Maintainability)

like image 631
Chris Smith Avatar asked Aug 08 '08 17:08

Chris Smith


People also ask

Can you use regular expressions to parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

What is the correct regular expression to match a URL?

$/; var url= content. match(urlR);

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

Is regex good for parsing?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.


1 Answers

((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)

I took this from regexlib.com

[editor's note: the {1} has no real function in this regex; see this post]

like image 187
csmba Avatar answered Sep 19 '22 21:09

csmba