During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.
I used the following regular expressions in Scrapy to eliminate html tags:
pattern = re.compile("<.*?>| |&",re.DOTALL|re.M)
Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:
pattern = re.compile("<.*?>| |&|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\\\u260e",re.DOTALL|re.M)
None of this worked and I still have \u260e as an output. How can I make this disappear?
Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.
Use the str. encode() method to encode the string using the ASCII encoding. Set the errors argument to ignore , so all non-ASCII characters are dropped.
The str. isalnum() method checks a string for the alphabet or number, and this property is used to remove special characters. The replace() method is used to replace special characters with empty characters or null values.
Using Python 2.7.3, the following works fine for me:
import re
pattern = re.compile(u"<.*?>| |&|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)
Output:
u'bla ble blo'
As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e
is now the -- probably -- two bytes used to write that little black phone ☎ (:
Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e
, they both match.
If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.
>>> import string
>>> foo = u"Lorum ☎ Ipsum"
>>> foo.replace(u'☎', '')
u'Lorum Ipsum'
>>> "".join(s for s in foo if s in string.printable)
u'Lorum Ipsum'
string.printable
You may try with BeatifulSoup, as explained here, with something like
soup = BeautifulSoup (html.decode('utf-8', 'ignore'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With