Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python replace unicode characters

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field.

Below is one of the example:

(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

I want to replace all the \x.. with a ?

I explicitly type \xc2 as follows works

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'

But its not working if I write as follow:

re.sub('\\\x..', '?', line)

How I can write a regular expression to replace them all?

like image 540
kenneth171 Avatar asked Nov 08 '22 08:11

kenneth171


1 Answers

There are better tools for this job than regex, you could try for example:

>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'

That skips non-ascii characters. Or with replace, you can swap them for a '?' placeholder:

>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)

But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages.

There is an excellent answer about unbaking emojibake here. Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer.

like image 171
wim Avatar answered Nov 14 '22 21:11

wim