Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to eliminate the ☎ unicode?

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.

I used the following regular expressions in Scrapy to eliminate html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

None of this worked and I still have \u260e as an output. How can I make this disappear?

like image 635
rafa Avatar asked May 06 '13 15:05

rafa


People also ask

How do I remove a weird character in Python?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.

How do I ignore ascii in Python?

Use the str. encode() method to encode the string using the ASCII encoding. Set the errors argument to ignore , so all non-ASCII characters are dropped.

How do I remove all special characters from a string in Python?

The str. isalnum() method checks a string for the alphabet or number, and this property is used to remove special characters. The replace() method is used to replace special characters with empty characters or null values.


3 Answers

Using Python 2.7.3, the following works fine for me:

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

Output:

u'bla ble  blo'

As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:

Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.

like image 88
Rubens Avatar answered Nov 09 '22 16:11

Rubens


If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'      
  • Remove non-ascii characters but leave periods and spaces for more information about string.printable
  • The SHORTEST way to remove multiple spaces in a string in Python if you don't want multiple whitespaces.
like image 4
timss Avatar answered Nov 09 '22 17:11

timss


You may try with BeatifulSoup, as explained here, with something like

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))
like image 1
kiriloff Avatar answered Nov 09 '22 16:11

kiriloff