I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter's scheme (http://www.qamus.org/transliteration.htm)
Here is my code to do so but it's very SLOW even with small files like 400 kb. Ideas to make it faster?
Thanks
def transliterate(file):
data = open(file).read()
buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}
for char in data:
for k, v in arabBuck.iteritems():
data = data.replace(k,v)
return data
Edit Oct 2021
There was a python package recently released that does this (and a lot more), so anyone reading this post now should ignore all the other answers and just use Camel Tools. (Nizar Habash and his team at NYU Abu Dhabi are awesome for developing this and making it so accessible!)
::python
from camel_tools.utils.charmap import CharMapper
sentence = "ذهبت إلى المكتبة."
print(sentence)
ar2bw = CharMapper.builtin_mapper('ar2bw')
sent_bw = ar2bw(sentence)
print(sent_bw)
Output:
هبت إلى المكتبة.
*hbt <lY Almktbp.
You can find install instructions and tutorials here: https://github.com/CAMeL-Lab/camel_tools
Old answer Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own: buckwalter2unicode.py
It probably does more than what you need, but you don't have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.
Edit: The script above is what I have been using, but I'm just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):
def transString(string, reverse=0):
'''Given a Unicode string, transliterate into Buckwalter. To go from
Buckwalter back to Unicode, set reverse=1'''
for k, v in buck2uni.items():
if not reverse:
string = string.replace(v, k)
else:
string = string.replace(k, v)
return string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With