Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

make python replace un-encodable chars with a string by default

I want to make python ignore chars it can't encode, by simply replacing them with the string "<could not encode>".

E.g, assuming the default encoding is ascii, the command

'%s is the word'%'ébác'

would yield

'<could not encode>b<could not encode>c is the word'

Is there any way to make this the default behavior, across all my project?

like image 750
olamundo Avatar asked Dec 19 '09 15:12

olamundo


People also ask

How do I keep ascii characters only in Python?

Now apply encode() function and it will help the user to encode the string into 'ASCII' and also pass the error as 'ignore' to remove Non-ASCII characters.

What does encode () do in Python?

The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.

How do you replace all characters in a string in Python?

Python String | replace() replace() is an inbuilt function in the Python programming language that returns a copy of the string where all occurrences of a substring are replaced with another substring. Parameters : old – old substring you want to replace. new – new substring which would replace the old substring.

How do I change the encoding of a string in Python?

Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.


2 Answers

The str.encode function takes an optional argument defining the error handling:

str.encode([encoding[, errors]])

From the docs:

Return an encoded version of the string. Default encoding is the current default string encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' and any other name registered via codecs.register_error(), see section Codec Base Classes. For a list of possible encodings, see section Standard Encodings.

In your case, the codecs.register_error function might be of interest.

[Note about bad chars]

By the way, note when using register_error that you'll likely find yourself replacing not just individual bad characters but groups of consecutive bad characters with your string, unless you pay attention. You get one call to the error handler per run of bad chars, not per char.

like image 69
miku Avatar answered Nov 07 '22 02:11

miku


>>> help("".encode)
Help on built-in function encode:

encode(...)
S.encode([encoding[,errors]]) -> object

Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. **Other possible values are** 'ignore', **'replace'** and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.

So, for instance:

>>> x
'\xc3\xa9b\xc3\xa1c is the word'
>>> x.decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> x.decode("ascii", "replace")
u'\ufffd\ufffdb\ufffd\ufffdc is the word'

Add your own callback to codecs.register_error to replace with the string of your choice.

like image 34
J.J. Avatar answered Nov 07 '22 03:11

J.J.