Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding Arabic letters with their diacritics (if exists)

I'm working on a deep learning project in which we use RNN. I want to encode the data before it's fed to the network. Input is Arabic verses, which have diacritics that are treated as separate characters in Python. I should encode/represent the character with the character following it with a number if the character following it is a diacritic, else I only encode the character.

Doing so for millions of verses, was hoping to use lambda with map. However, I cannot iterate with two characters at a time, i.e., was hoping for:

map(lambda ch, next_ch: encode(ch + next_ch) if is_diacritic(next_ch) else encode(ch), verse)

My intention behind the question is finding the fastest way to accomplish the above. No restriction on lambda functions, but for loop answers are not what I'm looking for.

A close example for non-Arabians, assume you want to encode the following text:

 XXA)L_I!I%M<LLL>MMQ*Q

You want to encode the letter after concatenating it with the letter following it if it's a special character, else just encode the letter only.

Output:

['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

For Arabians:

Verse example:

"قفا نبك من ذِكرى حبيب ومنزل بسِقطِ اللّوى بينَ الدَّخول فحَوْمل"

Diacritics are these small symbols above the letter(i.e., ّ , ْ )


[Update]

Range of diacritics starts at 64B HEX or 1611 INT and ends at 652 HEX or 1618 INT.

And letters 621 HEX - 1569 INT to 63A HEX - 1594 INT and from 641 HEX - 1601 INT to 64A HEX - 1610 INT

A letter can have at most one diacritic.


Extra information:

A similar encoding methodology to what I'm doing is representing the binary form of the verse as a matrix with shape (number of bits needed, number of characters in a verse). Both the number of bits and number of characters are calculated after we combine each letter with its diacritic if it exists.

For example, assume the verse is the following, and diacritics are special characters:

X+Y_XX+YYYY_

The alphabet different combinations are:

['X', 'X+', 'X_', 'Y', 'Y+', 'Y_']  

Hence I need 3 bits (at least) to represent these 6 characters, so the number of bits needed is 3

Consider the following encodings:

{
'X' : 000,
'X+': 001,
'X_': 010,
'Y':  011,
'Y+': 100,
'Y_': 101,
}

And I get to represent the example in a matrix as (binary representation is vertical):

X+     Y_    X    X+    Y    Y    Y    Y_
0      1     0    0     0    0    0    1
0      0     0    0     1    1    1    0
1      1     0    1     1    1    1    1

Which is why I am looking to combine diacritics with letters first.


Note: Iterate over a string 2 (or n) characters at a time in Python and Iterating each character in a string using Python do not give the intended answer.

like image 483
ndrwnaguib Avatar asked Jan 23 '19 09:01

ndrwnaguib


People also ask

Can UTF-8 handle Arabic characters?

UTF-8 also includes a variety of additional international characters, such as Chinese characters and Arabic characters.

What encoding does Arabic use?

In order for the Arabic characters to be displayed in URLs in your browser the characters are encoded into a Latin based encoding called UTF-8 which typically are a 4 character hexadecimal string. An example would be the Arabic letter و WAW which is converted to D988.

Are diacritics used in Arabic?

The Arabic script has numerous diacritics, which include: consonant pointing known as iʻjām (إِعْجَام), and supplementary diacritics known as tashkīl (تَشْكِيل). The latter include the vowel marks termed ḥarakāt (حَرَكَات; singular: حَرَكَة, ḥarakah).

How many Arabic diacritics are there?

The Eight Main Arabic Diacritics [20]


2 Answers

I'm going to throw my hat into the ring with numpy here. You can convert a string into a useable format with

arr = np.array([verse]).view(np.uint32)

You can mask the locations where the following character is diacritical:

mask = np.empty(arr.shape, dtype=np.bool)
np.bitwise_and((arr[1:] > lower), (arr[1:] < upper), out=mask[:-1])
mask[-1] = False

Here, the range [upper, lower] is a made up way for checking for diacritics. Implement the actual check however you like. In this example, I used the full-blown form of bitwise_and with empty to avoid a potentially expensive append of the last element.

Now if you have a numerical method for encoding your code points to a number, which I'm sure you can vectorize, you can do something like:

combined = combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]])

To get the remaining, uncombined characters, you would have to remove both the diactitics and the characters they bind to. The easiest way I can think of doing this is smearing the mask to the right and negating it. Again, I assume that you have a vectorized method to encode the single characters as well:

smeared = mask.copy()
smeared[1:] |= mask[:-1]
single = encode(arr[~smeared])

Combining the result into a final array is conceptually simple but takes a couple of steps. The result will be np.count_nonzeros(mask) elements shorter than the input, since diacritics are being removed. We need to shift all the mask elements by the amount of their index. Here's one way to do it:

ind = np.flatnonzero(mask)
nnz = ind.size
ind -= np.arange(nnz)

output = np.empty(arr.size - nnz, dtype='U1')
output[ind] = combined

# mask of unmodified elements
out_mask = np.ones(output.size, dtype=np.bool)
out_mask[ind] = False
output[out_mask] = single

The reason I'm suggesting numpy is that it should be able to handle a few million characters in a matter of seconds this way. Getting the output back as a string should be straightforward.

Suggested Implementation

I have been pondering your question and decided to play with some timings and possible implementations. My idea was to map the unicode characters in 0x0621-0x063A, 0x0641-0x064A (26 + 10 = 36 letters) into the lower 6 bits of a uint16, and the characters 0x064B-0x0652 (8 diacritics) to the next higher 3 bits, assuming these are in fact the only diacritics you need:

def encode_py(char):
    char = ord(char) - 0x0621
    if char >= 0x20:
        char -= 5
    return char

def combine_py(char, diacritic):
    return encode_py(char) | ((ord(diacritic) - 0x064A) << 6)

In numpy terms:

def encode_numpy(chars):
    chars = chars - 0x0621
    return np.subtract(chars, 5, where=chars > 0x20, out=chars)

def combine_numpy(chars, diacritics):
    chars = encode_numpy(chars)
    chars |= (diacritics - 0x064A) << 6
    return chars

You can choose to encode further to shorten the representation slightly, but I would not recommend it. This representation has the advantage of being verse-independent, so you can compare portions of different verses, as well as not worry about which representation you're going to get depending on how many verses you encoded together. You can even mask out the top bits of all the codes to compare the raw characters, without diacritics.

So let's say that your verse is a collection of randomly generated numbers in those ranges, with diacritics randomly generated to follow one letter each at most. We can generate a string of length around million pretty easily for comparative purposes:

import random

random.seed(0xB00B5)

alphabet = list(range(0x0621, 0x063B)) + list(range(0x0641, 0x064B))
diactitics = list(range(0x064B, 0x0653))

alphabet = [chr(x) for x in alphabet]
diactitics = [chr(x) for x in diactitics]

def sample(n=1000000, d=0.25):
    while n:
        yield random.choice(alphabet)
        n -= 1
        if n and random.random() < d:
            yield random.choice(diactitics)
            n -= 1

data = ''.join(sample())

This data has completely randomly distributed characters, with approximately a 25% chance of any character being followed by a diacritic. It takes just a few seconds to generate on my not-too-overpowered laptop.

The numpy conversion would look like this:

def convert_numpy(verse):
    arr = np.array([verse]).view(np.uint32)
    mask = np.empty(arr.shape, dtype=np.bool)
    mask[:-1] = (arr[1:] >= 0x064B)
    mask[-1] = False

    combined = combine_numpy(chars=arr[mask], diacritics=arr[1:][mask[:-1]])

    smeared = mask.copy()
    smeared[1:] |= mask[:-1]
    single = encode_numpy(arr[~smeared])

    ind = np.flatnonzero(mask)
    nnz = ind.size
    ind -= np.arange(nnz)

    output = np.empty(arr.size - nnz, dtype=np.uint16)
    output[ind] = combined

    # mask of unmodified elements
    out_mask = np.ones(output.size, dtype=np.bool)
    out_mask[ind] = False
    output[out_mask] = single

    return output

Benchmarks

And now let's %timeit to see how it goes. First, here are the other implementations. I convert everything to a numpy array or a list of integers for fair comparison. I've also made minor modifications to make the functions return lists of the same quantities to validate accuracy:

from itertools import tee, zip_longest
from functools import reduce

def is_diacritic(c):
    return ord(c) >= 0x064B

def pairwise(iterable, fillvalue):
    """ Slightly modified itertools pairwise recipe
    s -> (s0,s1), (s1,s2), (s2, s3), ... 
    """
    a, b = tee(iterable)
    next(b, None)
    return zip_longest(a, b, fillvalue=fillvalue)

def combine_py2(char, diacritic):
    return char | ((ord(diacritic) - 0x064A) << 6)

def convert_FHTMitchell(verse):
    def convert(verse):
        was_diacritic = False  # variable to keep track of diacritics -- stops us checking same character twice

        # fillvalue will not be encoded but ensures last char is read
        for this_char, next_char in pairwise(verse, fillvalue='-'):
            if was_diacritic:  # last next_char (so this_char) is diacritic
                was_diacritic = False
            elif is_diacritic(next_char):
                yield combine_py(this_char, next_char)
                was_diacritic = True
            else:
                yield encode_py(this_char)

    return list(convert(verse))

def convert_tobias_k_1(verse):
    return reduce(lambda lst, x: lst + [encode_py(x)] if not is_diacritic(x) else lst[:-1] + [combine_py2(lst[-1], x)], verse, [])

def convert_tobias_k_2(verse):
    res = []
    for x in verse:
        if not is_diacritic(x):
            res.append(encode_py(x))
        else:
            res[-1] = combine_py2(res[-1], x)
    return res

def convert_tobias_k_3(verse):
    return [combine_py(x, y) if y and is_diacritic(y) else encode_py(x) for x, y in zip_longest(verse, verse[1:], fillvalue="") if not is_diacritic(x)]

Now for the timings:

%timeit result_FHTMitchell = convert_FHTMitchell(data)
338 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_1 = convert_tobias_k_1(data)
Aborted, took > 5min to run. Appears to scale quadratically with input size: not OK!

%timeit result_tobias_k_2 = convert_tobias_k_2(data)
357 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_3 = convert_tobias_k_3(data)
466 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_numpy = convert_numpy(data)
30.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A comparison of the resulting arrays/lists shows that they are equal as well:

np.array_equal(result_FHTMitchell, result_tobias_k_2)  # True
np.array_equal(result_tobias_k_2, result_tobias_k_3)   # True
np.array_equal(result_tobias_k_3, result_numpy)        # True

I'm using array_equal here because it performs all the necessary type conversions to verify the actual data.

So the moral of the story is that there are lots of ways to do this, and parsing a few millions of characters shouldn't be prohibitively expensive on its own, until you get into cross referencing and other truly time-consuming tasks. The main thing to take from this is not to use reduce on lists, since you will be reallocating a lot more than you need to. Even a simple for loop will work fine for your purposes. Even though numpy is about ten times faster than the other implementations, it doesn't give a huge advantage.

Decoding

For the sake of completeness, here is a function to decode your results:

def decode(arr):
    mask = (arr > 0x3F)
    nnz = np.count_nonzero(mask)
    ind = np.flatnonzero(mask) + np.arange(nnz)

    diacritics = (arr[mask] >> 6) + 41
    characters = (arr & 0x3F)
    characters[characters >= 27] += 5

    output = np.empty(arr.size + nnz, dtype='U1').view(np.uint32)
    output[ind] = characters[mask]
    output[ind + 1] = diacritics

    output_mask = np.zeros(output.size, dtype=np.bool)
    output_mask[ind] = output_mask[ind + 1] = True
    output[~output_mask] = characters[~mask]

    output += 0x0621

    return output.base.view(f'U{output.size}').item()

As a side note, the work I did here inspired this question: Converting numpy arrays of code points to and from strings

like image 200
Mad Physicist Avatar answered Sep 22 '22 15:09

Mad Physicist


map does not seem to be the right tool for the job. You do not want to map characters to other characters, but group them together. Instead, you might try reduce (or functools.reduce in Python 3). Here, I use isalpha to test what kind of character it is; you might need something else.

>>> is_diacritic = lambda x: not x.isalpha()
>>> verse = "XXA)L_I!I%M<LLL>MMQ*Q"
>>> reduce(lambda lst, x: lst + [x] if not is_diacritic(x) else lst[:-1] + [lst[-1]+x], verse, [])
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

However, this is barely readable and also creates lots of intermediate lists. Better just use a boring old for loop, even if you explicitly asked for something else:

res = []
for x in verse:
    if not is_diacritic(x):
        res.append(x)
    else:
        res[-1] += x

By iterating pairs of consecutive characters, e.g. using zip(verse, verse[1:]) (i.e. (1,2), (2,3),..., not (1,2), (3,4), ...), you could indeed also use a list comprehension, but I'd still vote for the for loop for readability.

>>> [x + y if is_diacritic(y) else x
...  for x, y in zip_longest(verse, verse[1:], fillvalue="")
...  if not is_diacritic(x)]
...
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

You could even do the same using map and lambda, but you'd also need to filter first, with another lambda, making the whole thing orders of magnitude uglier and harder to read.

like image 26
tobias_k Avatar answered Sep 23 '22 15:09

tobias_k