Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast transliteration for Arabic Text with Python

Tags:

python

arabic

I always work on Arabic text files and to avoid problems with encoding I transliterate Arabic characters into English according to Buckwalter's scheme (http://www.qamus.org/transliteration.htm)

Here is my code to do so but it's very SLOW even with small files like 400 kb. Ideas to make it faster?

Thanks

     def transliterate(file):
          data = open(file).read()
          buckArab = {"'":"ء", "|":"آ", "?":"أ", "&":"ؤ", "<":"إ", "}":"ئ", "A":"ا", "b":"ب", "p":"ة", "t":"ت", "v":"ث", "g":"ج", "H":"ح", "x":"خ", "d":"د", "*":"ذ", "r":"ر", "z":"ز", "s":"س", "$":"ش", "S":"ص", "D":"ض", "T":"ط", "Z":"ظ", "E":"ع", "G":"غ", "_":"ـ", "f":"ف", "q":"ق", "k":"ك", "l":"ل", "m":"م", "n":"ن", "h":"ه", "w":"و", "Y":"ى", "y":"ي", "F":"ً", "N":"ٌ", "K":"ٍ", "~":"ّ", "o":"ْ", "u":"ُ", "a":"َ", "i":"ِ"}    
          for char in data: 
               for k, v in arabBuck.iteritems():
                     data = data.replace(k,v)                 
      return data
like image 785
Sabba Avatar asked Oct 06 '12 06:10

Sabba


1 Answers

Edit Oct 2021

There was a python package recently released that does this (and a lot more), so anyone reading this post now should ignore all the other answers and just use Camel Tools. (Nizar Habash and his team at NYU Abu Dhabi are awesome for developing this and making it so accessible!)

::python
from camel_tools.utils.charmap import CharMapper
sentence = "ذهبت إلى المكتبة."
print(sentence)

ar2bw = CharMapper.builtin_mapper('ar2bw')

sent_bw = ar2bw(sentence)
print(sent_bw)

Output:

هبت إلى المكتبة.
*hbt <lY Almktbp.

You can find install instructions and tutorials here: https://github.com/CAMeL-Lab/camel_tools


Old answer Incidentally, someone already wrote a script that does this, so you might want to check that out before spending too much time on your own: buckwalter2unicode.py

It probably does more than what you need, but you don't have to use all of it: I copied just the two dictionaries and the transliterateString function (with a few tweaks, I think), and use that on my site.

Edit: The script above is what I have been using, but I'm just discovered that it is much slower than using replace, especially for a large corpus. This is the code I finally ended up with, that seems to be simpler and faster (this references a dictionary buck2uni):

def transString(string, reverse=0):
    '''Given a Unicode string, transliterate into Buckwalter. To go from
    Buckwalter back to Unicode, set reverse=1'''

    for k, v in buck2uni.items():
        if not reverse:
            string = string.replace(v, k)
        else:
            string = string.replace(k, v)

    return string
like image 168
larapsodia Avatar answered Sep 18 '22 21:09

larapsodia