How do I rewrite this new way to recognise addresses to work in Python?
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
To find the URLs in a given string we have used the findall() function from the regular expression module of Python. This return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.
The regexes would probably be faster. A good regex engine (and Python has a good one) is a very fast way to do the sorts of string transformations it can handle. Unless you're really good with regexes though, it will be a bit harder to understand.
The original source for that states "This pattern should work in most modern regex implementations" and specifically Perl. Python's regex implementation is modern and similar to Perl's but is missing the [:punct:]
character class. You can easily build that using this:
>>> import string, re
>>> pat = r'\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^%s\s]|/)))'
>>> pat = pat % re.sub(r'([-\\\]])', r'\\\1', string.punctuation)
The re.sub()
call escapes certain characters inside the character set as required.
Edit: Using re.escape() works just as well, since it just sticks a backslash in front of everything. That felt crude to me at first, but certainly works fine for this case.
>>> pat = pat % re.escape(string.punctuation)
I don't think python have this expression
[:punct:]
Wikipedia says [:punct:]
is same to
[-!\"#$%&\'()*+,./:;<=>?@\\[\\\\]^_`{|}~]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With