What's the cleanest way to extract URLs from a string using Python?

Tags:

Although I know I could use some hugeass regex such as the one posted here I'm wondering if there is some tweaky as hell way to do this either with a standard module or perhaps some third-party add-on?

Simple question, but nothing jumped out on Google (or Stackoverflow).

Look forward to seeing how y'all do this!

722

asked Feb 06 '09 11:02

jkp

1 Answers

I know that it's exactly what you do not want but here's a file with a huge regex:

#!/usr/bin/python # -*- coding: utf-8 -*- """ the web url matching regex used by markdown http://daringfireball.net/2010/07/improved_regex_for_matching_urls https://gist.github.com/gruber/8891611 """ URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

I call that file urlmarker.py and when I need it I just import it, eg.

import urlmarker import re re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')

cf. http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Also, here is what Django (1.6) uses to validate URLFields:

regex = re.compile(     r'^(?:http|ftp)s?://'  # http:// or https://     r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...     r'localhost|'  # localhost...     r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'  # ...or ipv4     r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'  # ...or ipv6     r'(?::\d+)?'  # optional port     r'(?:/?|[/?]\S+)$', re.IGNORECASE)

cf. https://github.com/django/django/blob/1.6/django/core/validators.py#L43-50

Django 1.9 has that logic split across a few classes

127

answered Sep 28 '22 03:09

dranxo

Related questions
                            
                                Disable static file caching in Tornado
                            
                                What does os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)) mean? python
                            
                                How to make a query date in mongodb using pymongo?
                            
                                How do I create a link to another html page?
                            
                                Saving Matplotlib graphs to image as full screen
                            
                                quickly drop dataframe columns with only one distinct value
                            
                                How to call a function with a dictionary that contains more items than the function has parameters?
                            
                                how to concat two data frames with different column names in pandas? - python
                            
                                Pandas Fillna Mode
                            
                                How can I install pyCurl?
                            
                                How to set initial size for a dictionary in Python?
                            
                                Python Window Activation
                            
                                simple encrypt/decrypt lib in python with private key
                            
                                How to turn sqlalchemy logging off completely
                            
                                Is it possible to automatically break into the debugger when a exception is thrown?
                            
                                How to select columns from groupby object in pandas?
                            
                                Pandas Dataframe: Replacing NaN with row average
                            
                                How can I debug Python 3 code in Visual Studio Code?
                            
                                Overflow Error in Python's numpy.exp function
                            
                                Pandas style function to highlight specific columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the cleanest way to extract URLs from a string using Python?

Tags:

python

regex

url

jkp

People also ask

1 Answers

dranxo

Recent Activity

Donate For Us