<pre class="prettyprint"><code>import re url = 'Hello World<a href="http://example.com">More Examples</a><a href="http://example2.com">Even More Examples</a>' urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', url) >>> print urls ['http://example.com', 'http://example2.com'] </code></pre> The best answer is... <h3>Don't use a regex</h3> The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long. Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the url, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler? <h3>Parse the HTML instead</h3> For many tasks, using Beautiful Soup will be far faster and easier to use: <pre class="prettyprint"><code>>>> from bs4 import BeautifulSoup as Soup >>> html = Soup(s, 'html.parser') # Soup(s, 'lxml') if lxml is installed >>> [a['href'] for a in html.find_all('a')] ['http://example.com', 'http://example2.com'] </code></pre> If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of <code>HTMLParser</code> that does exactly what you want: <pre class="prettyprint"><code>from html.parser import HTMLParser class MyParser(HTMLParser): def __init__(self, output_list=None): HTMLParser.__init__(self) if output_list is None: self.output_list = [] else: self.output_list = output_list def handle_starttag(self, tag, attrs): if tag == 'a': self.output_list.append(dict(attrs).get('href')) </code></pre> Test: <pre class="prettyprint"><code>>>> p = MyParser() >>> p.feed(s) >>> p.output_list ['http://example.com', 'http://example2.com'] </code></pre> You could even create a new method that accepts a string, calls <code>feed</code>, and returns <code>output_list</code>. This is a vastly more powerful and extensible way than regular expressions to extract information from html.

Regex to extract URLs from href attribute in HTML with Python [duplicate]

Tags:

Don't use a regex

The expression in the accepted answer misses many cases. Among other things, URLs can have unicode characters in them. The regex you want is here, and after looking at it, you may conclude that you don't really want it after all. The most correct version is ten-thousand characters long.

Admittedly, if you were starting with plain, unstructured text with a bunch of URLs in it, then you might need that ten-thousand-character-long regex. But if your input is structured, use the structure. Your stated aim is to "extract the url, inside the anchor tag's href." Why use a ten-thousand-character-long regex when you can do something much simpler?

Parse the HTML instead

For many tasks, using Beautiful Soup will be far faster and easier to use:

>>> from bs4 import BeautifulSoup as Soup
>>> html = Soup(s, 'html.parser')           # Soup(s, 'lxml') if lxml is installed
>>> [a['href'] for a in html.find_all('a')]
['http://example.com', 'http://example2.com']

If you prefer not to use external tools, you can also directly use Python's own built-in HTML parsing library. Here's a really simple subclass of HTMLParser that does exactly what you want:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self, output_list=None):
        HTMLParser.__init__(self)
        if output_list is None:
            self.output_list = []
        else:
            self.output_list = output_list
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            self.output_list.append(dict(attrs).get('href'))

Test:

>>> p = MyParser()
>>> p.feed(s)
>>> p.output_list
['http://example.com', 'http://example2.com']

You could even create a new method that accepts a string, calls feed, and returns output_list. This is a vastly more powerful and extensible way than regular expressions to extract information from html.

Related questions
                            
                                Read file content from S3 bucket with boto3
                            
                                Overriding "+=" in Python? (__iadd__() method)
                            
                                timeit versus timing decorator
                            
                                How to one-hot-encode from a pandas column containing a list?
                            
                                Pip error: Microsoft Visual C++ 14.0 is required
                            
                                Python slice first and last element in list
                            
                                Can you have variables within triple quotes? If so, how?
                            
                                Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?
                            
                                How to check if a variable is equal to one string or another string? [duplicate]
                            
                                Plot width settings in ipython notebook
                            
                                Tensorflow set CUDA_VISIBLE_DEVICES within jupyter
                            
                                How do I set headers using python's urllib?
                            
                                Built in Python hash() function
                            
                                How to strip html/javascript from text input in django
                            
                                How to change status of JsonResponse in Django
                            
                                How to limit execution time of a function call?
                            
                                Given a pandas Series that represents frequencies of a value, how can I turn those frequencies into percentages?
                            
                                How to iterate over consecutive chunks of Pandas dataframe efficiently
                            
                                How do I get JSON data from RESTful service using Python?
                            
                                Python - why use "self" in a class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex to extract URLs from href attribute in HTML with Python [duplicate]

Tags:

python

regex

url

Don't use a regex

Parse the HTML instead

Related questions

Recent Activity

Donate For Us