efficiently replace bad characters

Tags:

I often work with utf-8 text containing characters like:

\xc2\x99

\xc2\x95

\xc2\x85

etc

These characters confuse other libraries I work with so need to be replaced.

What is an efficient way to do this, rather than:

text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')

641

asked Jul 07 '11 11:07

2 Answers

There is always regular expressions; just list all of the offending characters inside square brackets like so:

import re print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")

This prints: 'Hello There ', with the unwanted characters replaced by spaces.

Alternately, if you have a different replacement character for each:

# remove annoying characters chars = {     '\xc2\x82' : ',',        # High code comma     '\xc2\x84' : ',,',       # High code double comma     '\xc2\x85' : '...',      # Tripple dot     '\xc2\x88' : '^',        # High carat     '\xc2\x91' : '\x27',     # Forward single quote     '\xc2\x92' : '\x27',     # Reverse single quote     '\xc2\x93' : '\x22',     # Forward double quote     '\xc2\x94' : '\x22',     # Reverse double quote     '\xc2\x95' : ' ',     '\xc2\x96' : '-',        # High hyphen     '\xc2\x97' : '--',       # Double hyphen     '\xc2\x99' : ' ',     '\xc2\xa0' : ' ',     '\xc2\xa6' : '|',        # Split vertical bar     '\xc2\xab' : '<<',       # Double less than     '\xc2\xbb' : '>>',       # Double greater than     '\xc2\xbc' : '1/4',      # one quarter     '\xc2\xbd' : '1/2',      # one half     '\xc2\xbe' : '3/4',      # three quarters     '\xca\xbf' : '\x27',     # c-single quote     '\xcc\xa8' : '',         # modifier - under curve     '\xcc\xb1' : ''          # modifier - under line } def replace_chars(match):     char = match.group(0)     return chars[char] return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)

answered Sep 22 '22 13:09

Nate

I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.

\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?

Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)

If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)

>>> b'\x95'.decode('windows-1252') '\u2022' >>> import unicodedata >>> unicodedata.name(_) 'BULLET'

If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:

def restore_windows_1252_characters(s):     """Replace C1 control characters in the Unicode string s by the     characters at the corresponding code points in Windows-1252,     where possible.      """     import re     def to_windows_1252(match):         try:             return bytes([ord(match.group(0))]).decode('windows-1252')         except UnicodeDecodeError:             # No character at the corresponding code point: remove it.             return ''     return re.sub(r'[\u0080-\u0099]', to_windows_1252, s)

For example:

>>> restore_windows_1252_characters('\x95\x99\x85') '•™…'

answered Sep 21 '22 13:09

Gareth Rees

Related questions
                            
                                How to hide wpf datagrid columns depending on a property
                            
                                Getting an error message while building PhoneGapSample in blackberry Webworks
                            
                                jQuery Validate, ASP.NET MVC ModelState Errors (Async POST)
                            
                                How to solve Only Web services with a [ScriptService] attribute on the class definition can be called from script
                            
                                IE: nth-child() using odd/even isn't working
                            
                                How do I uninstall ruby and gems using RVM?
                            
                                Present storyboard ViewController from another ViewController
                            
                                If-then-else inside a JSP expression?
                            
                                Single Animation - Multiple Views
                            
                                Python - Classes and OOP Basics
                            
                                history.pushState does not trigger 'popstate' event
                            
                                Wait until setInterval() is done

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

efficiently replace bad characters

Tags:

hoju

People also ask

2 Answers

Nate

Gareth Rees

Recent Activity

Donate For Us