Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python arabic encoding issue

i have a text with windows-1256 encoding. now i want to convert text from arabic(windows-1256) to utf-8

sample text :

Óæí Ïæã ÈíåÞí

result :

سوي دوم بيهقي

i use this code to decode and encod to utf-8

# -*- coding: utf-8 -*-

data = "Óæí Ïæã ÈíåÞí"
print data.decode("windows-1256", "replace")
print data.encode("windows-1256")

that code return this result:

أ“أ¦أ­ أڈأ¦أ£ أˆأ­أ¥أ‍أ­
Traceback (most recent call last):
  File "mohmal2.py", line 5, in <module>
    print data.encode("windows-1256")
  File "/usr/lib/python2.7/encodings/cp1256.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

i found a site that can convert this text:

http://www.iosart.com

like image 732
Amir Mohsen Avatar asked Apr 19 '17 13:04

Amir Mohsen


People also ask

How do I fix encoding in Python?

The best way to attack the problem, as with many things in Python, is to be explicit. That means that every string that your code handles needs to be clearly treated as either Unicode or a byte sequence. The most systematic way to accomplish this is to make your code into a Unicode-only clean room.

Can UTF-8 handle Arabic characters?

UTF-8 can store the full Unicode range, so it's fine to use for Arabic.

What is encoding UTF-8 in Python?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.


2 Answers

It looks like you have accidentally decoded the input as Windows-1252.

>>> "Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
'سوي دوم بيهقي'
like image 112
Josh Lee Avatar answered Oct 03 '22 02:10

Josh Lee


I would like to add to @josh-lee answer the case for python2.
If you are using python 2, add unicode prefix u.

>>> u"Óæí Ïæã ÈíåÞí".encode('cp1252').decode('cp1256')
u'\u0633\u0648\u064a \u062f\u0648\u0645 \u0628\u064a\u0647\u0642\u064a'
>>> print _
سوي دوم بيهقي
like image 21