Regular expression to extract URL from an HTML link

People also ask

How do I find the URL of a string?

In Java, this can be done by using Pattern. matcher(). Find the substring from the first index of match result to the last index of the match result and add this substring into the list. After completing the above steps, if the list is found to be empty, then print “-1” as there is no URL present in the string S.

What is URL regex?

URL regular expressions can be used to verify if a string has a valid URL format as well as to extract an URL from a string.

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

There's tonnes of them on regexlib

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

Related questions
                            
                                Why is subtraction faster than addition in Python?
                            
                                How to run " ps cax | grep something " in Python?
                            
                                Get an object attribute [duplicate]
                            
                                Pandas sum two columns, skipping NaN
                            
                                python flask redirect to https from http
                            
                                How to get file extension correctly?
                            
                                Python list slice syntax used for no obvious reason
                            
                                Iterating through a multidimensional array in Python
                            
                                Eclipse Organize Imports Shortcut (Ctrl+Shift+O) is not working
                            
                                Fastest way to zero out low values in array?
                            
                                Django: how to get format date in views?
                            
                                Django : Table doesn't exist
                            
                                Pandas Series of lists to one series
                            
                                Python min function with a list of objects
                            
                                averaging list of lists python column-wise
                            
                                Pandas: Get duplicated indexes
                            
                                pip is showing error 'lsb_release -a' returned non-zero exit status 1
                            
                                How do I merge two lists into a single list?
                            
                                How to change the keys of a dictionary?
                            
                                How to launch getattr function in python with additional parameters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular expression to extract URL from an HTML link

Tags:

python

regex

People also ask

Recent Activity

Donate For Us