Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently replace bad characters

Tags:

I often work with utf-8 text containing characters like:

\xc2\x99

\xc2\x95

\xc2\x85

etc

These characters confuse other libraries I work with so need to be replaced.

What is an efficient way to do this, rather than:

text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...') 
like image 641
hoju Avatar asked Jul 07 '11 11:07

hoju


People also ask

How do I replace multiple characters in a string?

To replace multiple characters in a string, chain multiple calls to the replaceAll() method, e.g. str. replaceAll('. ', '! ').

How do I remove bad characters from a string in Python?

One can use str. replace() inside a loop to check for a bad_char and then replace it with the empty string hence removing it.

What is difference between Unicode and Ascii?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.

How do you replace multiple characters in Python?

Method 2: Replace multiple characters using translate() + maketrans() There is also a dedication function that can perform this type of replacement task in a single line hence this is a recommended way to solve this particular problem.


2 Answers

There is always regular expressions; just list all of the offending characters inside square brackets like so:

import re print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99") 

This prints: 'Hello There ', with the unwanted characters replaced by spaces.

Alternately, if you have a different replacement character for each:

# remove annoying characters chars = {     '\xc2\x82' : ',',        # High code comma     '\xc2\x84' : ',,',       # High code double comma     '\xc2\x85' : '...',      # Tripple dot     '\xc2\x88' : '^',        # High carat     '\xc2\x91' : '\x27',     # Forward single quote     '\xc2\x92' : '\x27',     # Reverse single quote     '\xc2\x93' : '\x22',     # Forward double quote     '\xc2\x94' : '\x22',     # Reverse double quote     '\xc2\x95' : ' ',     '\xc2\x96' : '-',        # High hyphen     '\xc2\x97' : '--',       # Double hyphen     '\xc2\x99' : ' ',     '\xc2\xa0' : ' ',     '\xc2\xa6' : '|',        # Split vertical bar     '\xc2\xab' : '<<',       # Double less than     '\xc2\xbb' : '>>',       # Double greater than     '\xc2\xbc' : '1/4',      # one quarter     '\xc2\xbd' : '1/2',      # one half     '\xc2\xbe' : '3/4',      # three quarters     '\xca\xbf' : '\x27',     # c-single quote     '\xcc\xa8' : '',         # modifier - under curve     '\xcc\xb1' : ''          # modifier - under line } def replace_chars(match):     char = match.group(0)     return chars[char] return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text) 
like image 88
Nate Avatar answered Sep 22 '22 13:09

Nate


I think that there is an underlying problem here, and it might be a good idea to investigate and maybe solve it, rather than just trying to cover up the symptoms.

\xc2\x95 is the UTF-8 encoding of the character U+0095, which is a C1 control character (MESSAGE WAITING). It is not surprising that your library cannot handle it. But the question is, how did it get into your data?

Well, one very likely possibility is that it started out as the character 0x95 (BULLET) in the Windows-1252 encoding, was wrongly decoded as U+0095 instead of the correct U+2022, and then encoded into UTF-8. (The Japanese term mojibake describes this kind of mistake.)

If this is correct, then you can recover the original characters by putting them back into Windows-1252 and then decoding them into Unicode correctly this time. (In these examples I am using Python 3.3; these operations are a bit different in Python 2.)

>>> b'\x95'.decode('windows-1252') '\u2022' >>> import unicodedata >>> unicodedata.name(_) 'BULLET' 

If you want to do this correction for all the characters in the range 0x80–0x99 that are valid Windows-1252 characters, you can use this approach:

def restore_windows_1252_characters(s):     """Replace C1 control characters in the Unicode string s by the     characters at the corresponding code points in Windows-1252,     where possible.      """     import re     def to_windows_1252(match):         try:             return bytes([ord(match.group(0))]).decode('windows-1252')         except UnicodeDecodeError:             # No character at the corresponding code point: remove it.             return ''     return re.sub(r'[\u0080-\u0099]', to_windows_1252, s) 

For example:

>>> restore_windows_1252_characters('\x95\x99\x85') '•™…' 
like image 25
Gareth Rees Avatar answered Sep 21 '22 13:09

Gareth Rees