When I was trying to read a text file with the following python code:
with open(file, 'r') as myfile:
data = myfile.read()
Got some weird characters start with \x...., what do they stand for and how to get rid of them in reading a text file?
e.g.
...... \xc2\xa0 \xc2\xa0 chapter 1 tuesday 1984 \xe2\x80\x9chey , jake , your mom sent me to pick you up \xe2\x80\x9d jacob robbins knew better than to accept a ride from a stranger , but when his mom\xe2\x80\x99s friend ronny was waiting for him in front of school he reluctantly got in the car \xe2\x80\x9cmy name is jacob........
That's UTF-8 encoded text. You open the file as UTF-8.
with open(file, 'r', encoding='utf-8') as myfile:
...
2.x:
with codecs.open(file, 'r', encoding='utf-8') as myfile:
...
Unicode In Python, Completely Demystified
Those are string escapes. They represent a character by its hexadecimal value. For example, \x24 is 0x24, which is the dollar sign.
>>> '\x24'
'$'
>>> chr(0x24)
'$'
One such escape (from the ones you provided) is \xc2 which is Â, a capital A with a circumflex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With