Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all forms of URLs from a given string in Python

Tags:

python

regex

I am new to python and was wondering if there was a better solution to match all forms of URLs that might be found in a given string. Upon googling, there seems to a lot of solutions that extract domains, replace it with links etc, but none that removes / deletes them from a string. I have mentioned some examples below for reference. Thanks!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

Error Log:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
like image 934
Prem Minister Avatar asked Dec 29 '12 11:12

Prem Minister


People also ask

How do you remove links from a string in Python?

sub(r'http\S+', '', my_string) . The re. sub() method will remove any URLs from the string by replacing them with empty strings.

How do you remove URL from text?

To remove a hyperlink but keep the text, right-click the hyperlink and click Remove Hyperlink. To remove the hyperlink completely, select it and then press Delete.

How do I extract a URL from text in Python?

URL extraction is achieved from a text file by using regular expression. The expression fetches the text wherever it matches the pattern. Only the re module is used for this purpose.


1 Answers

Include encoding line at the top of your source file(the regex string contains non-ascii symbols like »), e.g.:

# -*- coding: utf-8 -*-
import re
...

Also surround your regex string in triple single(or double)quotes - ''' or """ instead of single as this string already contains quote symbols itself(' and ").

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''
like image 62
kerim Avatar answered Sep 20 '22 03:09

kerim