i have a text with windows-1256 encoding. now i want to convert text from arabic(windows-1256) to utf-8
sample text :
Óæí Ïæã ÈíåÞí
result :
سوي دوم بيهقي
i use this code to decode and encod to utf-8
# -*- coding: utf-8 -*-
data = "Óæí Ïæã ÈíåÞí"
print data.decode("windows-1256", "replace")
print data.encode("windows-1256")
that code return this result:
أ“أ¦أ أڈأ¦أ£ أˆأأ¥أأ
Traceback (most recent call last):
File "mohmal2.py", line 5, in <module>
print data.encode("windows-1256")
File "/usr/lib/python2.7/encodings/cp1256.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
i found a site that can convert this text:
http://www.iosart.com
The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.
UTF-8 can store the full Unicode range, so it's fine to use for Arabic.
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
It looks like you have accidentally decoded the input as Windows-1252.
>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
'سوي دوم بيهقي'
I would like to add to @josh-lee answer the case for python2.
If you are using python 2, add unicode prefix u
.
>>> u"Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
u'\u0633\u0648\u064a \u062f\u0648\u0645 \u0628\u064a\u0647\u0642\u064a'
>>> print _
سوي دوم بيهقي
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With