Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decoding UTF-8 strings in Python

Tags:

I'm writing a web crawler in python, and it involves taking headlines from websites.

One of the headlines should've read : And the Hip's coming, too

But instead it said: And the Hip’s coming, too

What's going wrong here?

like image 654
user1624005 Avatar asked Oct 28 '12 16:10

user1624005


People also ask

What does decode (' UTF-8 ') do in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

How do I decode a string in Python?

String decoding using DECODE() FUNCTION Decode() function is used to convert the encoded string back into the original form. It takes the encoded string as an input and return the original string. Like encode() function, it also uses error parameter for handling errors which arise as a result of a decode() method.

Is Python a UTF-8 string?

Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.

How do I decode a Unicode string?

So first, write the very first character of the encoded string and remove it from the encoded string then start adding the first character of the encoded string first to the left and then to the right of the decoded string and do this task repeatedly till the encoded string becomes empty.


1 Answers

It's an encoding error - so if it's a unicode string, this ought to fix it:

text.encode("windows-1252").decode("utf-8") 

If it's a plain string, you'll need an extra step:

text.decode("utf-8").encode("windows-1252").decode("utf-8") 

Both of these will give you a unicode string.

By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:

>>> import chardet >>> chardet.detect(u"And the Hip’s coming, too") {'confidence': 0.5, 'encoding': 'windows-1252'} 
like image 95
Zero Piraeus Avatar answered Oct 17 '22 20:10

Zero Piraeus