I'm writing a web crawler in python, and it involves taking headlines from websites.
One of the headlines should've read : And the Hip's coming, too
But instead it said: And the Hip’s coming, too
What's going wrong here?
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
String decoding using DECODE() FUNCTION Decode() function is used to convert the encoded string back into the original form. It takes the encoded string as an input and return the original string. Like encode() function, it also uses error parameter for handling errors which arise as a result of a decode() method.
Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.
So first, write the very first character of the encoded string and remove it from the encoded string then start adding the first character of the encoded string first to the left and then to the right of the decoded string and do this task repeatedly till the encoded string becomes empty.
It's an encoding error - so if it's a unicode string, this ought to fix it:
text.encode("windows-1252").decode("utf-8")
If it's a plain string, you'll need an extra step:
text.decode("utf-8").encode("windows-1252").decode("utf-8")
Both of these will give you a unicode string.
By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:
>>> import chardet >>> chardet.detect(u"And the Hip’s coming, too") {'confidence': 0.5, 'encoding': 'windows-1252'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With