How to eliminate the ☎ unicode?

Tags:

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.

I used the following regular expressions in Scrapy to eliminate html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

None of this worked and I still have \u260e as an output. How can I make this disappear?

635

asked May 06 '13 15:05

rafa

3 Answers

Using Python 2.7.3, the following works fine for me:

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

Output:

u'bla ble  blo'

As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:

Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.

answered Nov 09 '22 16:11

Rubens

If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'

Remove non-ascii characters but leave periods and spaces for more information about string.printable
The SHORTEST way to remove multiple spaces in a string in Python if you don't want multiple whitespaces.

answered Nov 09 '22 17:11

timss

You may try with BeatifulSoup, as explained here, with something like

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

answered Nov 09 '22 16:11

kiriloff

Related questions
                            
                                Use lxml to parse text file with bad header in Python
                            
                                Selenium WebDriver (2.25) Timeout Not Working
                            
                                How do I display and close an image with Python?
                            
                                Data type error with drawContours unless I pickle/unpickle first
                            
                                Dynamically change widget background color in Tkinter
                            
                                python compare datetimes with different timezones
                            
                                Python regex compile (with re.VERBOSE) not working
                            
                                Extract text with lxml.html
                            
                                Convert pyBarcode output to PIL Image file
                            
                                python: recurcive list processing changes original list
                            
                                win32gui get the current active application name
                            
                                Manipulating the numpy.random.exponential distribution in Python
                            
                                Bug in Python's documentation?
                            
                                python http/udp bittorrent tracker scrape library
                            
                                How to selectively import module in python?
                            
                                How do i test/refactor my tests?
                            
                                Using git submodule to import a python project
                            
                                DBSCAN with python and scikit-learn: What exactly are the integer labes returned by make_blobs?
                            
                                merging xml files using python's ElementTree
                            
                                How to predict the topic of a new query using a trained LDA model using gensim?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to eliminate the ☎ unicode?

Tags:

python

regex

python-2.7

scrapy