Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix broken utf-8 encoding in Python?

My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start to try by Python

mystr = '09. Bát Nhã Tâm Kinh'
mystr.decode('utf-8')

but actually it is not correct because original string is utf-8 but the string show is not my expecting result.

Note: it is Vietnamese character.

How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.

like image 912
giaosudau Avatar asked Oct 21 '14 16:10

giaosudau


1 Answers

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

like image 149
Dima Rostopira Avatar answered Oct 18 '22 03:10

Dima Rostopira