I'm writing a web crawler in python, and it involves taking headlines from websites. One of the headlines should've read : And the Hip's coming, too But instead it said: And the Hipâ€™s coming, too What's going wrong here?

It's an encoding error - so if it's a unicode string, this ought to fix it: <pre class="prettyprint"><code>text.encode("windows-1252").decode("utf-8") </code></pre> If it's a plain string, you'll need an extra step: <pre class="prettyprint"><code>text.decode("utf-8").encode("windows-1252").decode("utf-8") </code></pre> Both of these will give you a unicode string. By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet: <pre class="prettyprint"><code>>>> import chardet >>> chardet.detect(u"And the Hipâ€™s coming, too") {'confidence': 0.5, 'encoding': 'windows-1252'} </code></pre>

Decoding UTF-8 strings in Python

1 Answers

It's an encoding error - so if it's a unicode string, this ought to fix it:

text.encode("windows-1252").decode("utf-8")

If it's a plain string, you'll need an extra step:

text.decode("utf-8").encode("windows-1252").decode("utf-8")

Both of these will give you a unicode string.

By the way - to discover how a piece of text like this has been mangled due to encoding issues, you can use chardet:

>>> import chardet >>> chardet.detect(u"And the Hipâ€™s coming, too") {'confidence': 0.5, 'encoding': 'windows-1252'}

answered Oct 17 '22 20:10

Zero Piraeus

Related questions
                            
                                Disable sleep on certain activity
                            
                                Is there a Python library to list primes?
                            
                                Creating a string variable name from the value of another string
                            
                                pandas' transform doesn't work sorting groupby output
                            
                                "ORA-01438: value larger than specified precision allowed for this column" when inserting 3
                            
                                Float formatting in C++
                            
                                LINQ way to get items between two indexes in a List
                            
                                Refreshing a UICollectionview
                            
                                SED command error on MACOS X
                            
                                Vector going out of bounds without giving error
                            
                                Filling a queue and managing multiprocessing in python
                            
                                Find the dimensions of a multidimensional Python array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Decoding UTF-8 strings in Python

Tags:

user1624005

People also ask

1 Answers

Zero Piraeus

Recent Activity

Donate For Us