I've struggled with encodings for far too long, and today I want to break the mental block wide open.
Right now, I'm using Requests to scrape a bunch of websites, and from what I can tell it is using the HTTP headers to figure out the encodings that the pages are using, falling back to chardet when the site's headers are missing. From there, it decodes the bytecode it downloads, and then helpfully hands me a unicode object in r.text
.
All good.
But where I'm confused is that from there I do some work on the text and then print it out to stdout, providing an encoding when I print:
print foo.encode('utf-8')
The problem is that when I do that, the thing that's printed is messed up. In the following, I expect to get an emdash between the word 'judgments' and 'Standard':
Declaratory judgmentsStandard of review.
Instead, I get the boxy thing with the four tiny numbers in it. It doesn't seem to show up here, of course, but I think the numbers are 0097, which corresponds to what I get if I do:
repr(foo)
u'Declaratory judgments\x97Standard of review.'
So that kind of makes sense, but where's my emdash?
The process boils down to:
Where's the problem? This sounds like the mythical unicode sandwich to me, but clearly I'm missing something.
You are doing something odd. \x97
is an emdash in the cp1252 encoding. In a Unicode string, it's U+0097 END OF GUARDED AREA. Somehow, you are reading cp1252 bytes as Unicode. Show more of the code that got you to this state, and we can dig deeper.
PS: the Unicode sandwich is hardly mythical, it is an ideal to strive for! :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With