I often work with utf-8 text containing characters like:
\xc2\x99
\xc2\x95
\xc2\x85
etc
These characters confuse other libraries I work with so need to be replaced.
What is an efficient way to do this, rather than:
text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')
To replace multiple characters in a string, chain multiple calls to the replaceAll() method, e.g. str. replaceAll('. ', '! ').
One can use str. replace() inside a loop to check for a bad_char and then replace it with the empty string hence removing it.
Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.
Method 2: Replace multiple characters using translate() + maketrans() There is also a dedication function that can perform this type of replacement task in a single line hence this is a recommended way to solve this particular problem.
There is always regular expressions; just list all of the offending characters inside square brackets like so:
import re print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
This prints: 'Hello There ', with the unwanted characters replaced by spaces.
Alternately, if you have a different replacement character for each:
# remove annoying characters chars = { '\xc2\x82' : ',', # High code comma '\xc2\x84' : ',,', # High code double comma '\xc2\x85' : '...', # Tripple dot '\xc2\x88' : '^', # High carat '\xc2\x91' : '\x27', # Forward single quote '\xc2\x92' : '\x27', # Reverse single quote '\xc2\x93' : '\x22', # Forward double quote '\xc2\x94' : '\x22', # Reverse double quote '\xc2\x95' : ' ', '\xc2\x96' : '-', # High hyphen '\xc2\x97' : '--', # Double hyphen '\xc2\x99' : ' ', '\xc2\xa0' : ' ', '\xc2\xa6' : '|', # Split vertical bar '\xc2\xab' : '<<', # Double less than '\xc2\xbb' : '>>', # Double greater than '\xc2\xbc' : '1/4', # one quarter '\xc2\xbd' : '1/2', # one half '\xc2\xbe' : '3/4', # three quarters '\xca\xbf' : '\x27', # c-single quote '\xcc\xa8' : '', # modifier - under curve '\xcc\xb1' : '' # modifier - under line } def replace_chars(match): char = match.group(0) return chars[char] return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)
I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.
\xc2\x95
is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?
Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)
If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)
>>> b'\x95'.decode('windows-1252') '\u2022' >>> import unicodedata >>> unicodedata.name(_) 'BULLET'
If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:
def restore_windows_1252_characters(s): """Replace C1 control characters in the Unicode string s by the characters at the corresponding code points in Windows-1252, where possible. """ import re def to_windows_1252(match): try: return bytes([ord(match.group(0))]).decode('windows-1252') except UnicodeDecodeError: # No character at the corresponding code point: remove it. return '' return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)
For example:
>>> restore_windows_1252_characters('\x95\x99\x85') '•™…'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With