I'm working on a deep learning project in which we use RNN. I want to encode the data before it's fed to the network. Input is Arabic verses, which have diacritics that are treated as separate characters in Python. I should encode/represent the character with the character following it with a number if the character following it is a diacritic, else I only encode the character. Doing so for millions of verses, was hoping to use <code>lambda</code> with <code>map</code>. However, I cannot iterate with two characters at a time, i.e., was hoping for: <pre class="prettyprint"><code>map(lambda ch, next_ch: encode(ch + next_ch) if is_diacritic(next_ch) else encode(ch), verse) </code></pre> My intention behind the question is finding the fastest way to accomplish the above. No restriction on lambda functions, but <code>for</code> loop answers are not what I'm looking for. A close example for non-Arabians, assume you want to encode the following text: <pre class="prettyprint"><code> XXA)L_I!I%M<LLL>MMQ*Q </code></pre> You want to encode the letter after concatenating it with the letter following it if it's a special character, else just encode the letter only. Output: <pre class="prettyprint"><code>['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q'] </code></pre> For Arabians: Verse example: "قفا نبك من ذِكرى حبيب ومنزل بسِقطِ اللّوى بينَ الدَّخول فحَوْمل" Diacritics are these small symbols above the letter(i.e., ّ , ْ ) <hr> [Update] Range of diacritics starts at 64B HEX or 1611 INT and ends at 652 HEX or 1618 INT. And letters 621 HEX - 1569 INT to 63A HEX - 1594 INT and from 641 HEX - 1601 INT to 64A HEX - 1610 INT A letter can have at most one diacritic. <hr> Extra information: A similar encoding methodology to what I'm doing is representing the binary form of the verse as a matrix with shape <code>(number of bits needed, number of characters in a verse)</code>. Both the number of bits and number of characters are calculated after we combine each letter with its diacritic if it exists. For example, assume the verse is the following, and diacritics are special characters: <pre class="prettyprint"><code>X+Y_XX+YYYY_ </code></pre> The alphabet different combinations are: <pre class="prettyprint"><code>['X', 'X+', 'X_', 'Y', 'Y+', 'Y_'] </code></pre> Hence I need <code>3</code> bits (at least) to represent these <code>6</code> characters, so the <code>number of bits needed</code> is <code>3</code> Consider the following encodings: <pre class="prettyprint"><code>{ 'X' : 000, 'X+': 001, 'X_': 010, 'Y': 011, 'Y+': 100, 'Y_': 101, } </code></pre> And I get to represent the example in a matrix as (binary representation is vertical): <pre class="prettyprint"><code>X+ Y_ X X+ Y Y Y Y_ 0 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 </code></pre> Which is why I am looking to combine diacritics with letters first. <hr> Note: Iterate over a string 2 (or n) characters at a time in Python and Iterating each character in a string using Python do not give the intended answer.

I'm going to throw my hat into the ring with numpy here. You can convert a string into a useable format with <pre class="prettyprint"><code>arr = np.array([verse]).view(np.uint32) </code></pre> You can mask the locations where the following character is diacritical: <pre class="prettyprint"><code>mask = np.empty(arr.shape, dtype=np.bool) np.bitwise_and((arr[1:] > lower), (arr[1:] < upper), out=mask[:-1]) mask[-1] = False </code></pre> Here, the range <code>[upper, lower]</code> is a made up way for checking for diacritics. Implement the actual check however you like. In this example, I used the full-blown form of <code>bitwise_and</code> with <code>empty</code> to avoid a potentially expensive append of the last element. Now if you have a numerical method for encoding your code points to a number, which I'm sure you can vectorize, you can do something like: <pre class="prettyprint"><code>combined = combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]]) </code></pre> To get the remaining, uncombined characters, you would have to remove both the diactitics and the characters they bind to. The easiest way I can think of doing this is smearing the mask to the right and negating it. Again, I assume that you have a vectorized method to encode the single characters as well: <pre class="prettyprint"><code>smeared = mask.copy() smeared[1:] |= mask[:-1] single = encode(arr[~smeared]) </code></pre> Combining the result into a final array is conceptually simple but takes a couple of steps. The result will be <code>np.count_nonzeros(mask)</code> elements shorter than the input, since diacritics are being removed. We need to shift all the mask elements by the amount of their index. Here's one way to do it: <pre class="prettyprint"><code>ind = np.flatnonzero(mask) nnz = ind.size ind -= np.arange(nnz) output = np.empty(arr.size - nnz, dtype='U1') output[ind] = combined # mask of unmodified elements out_mask = np.ones(output.size, dtype=np.bool) out_mask[ind] = False output[out_mask] = single </code></pre> The reason I'm suggesting numpy is that it should be able to handle a few million characters in a matter of seconds this way. Getting the output back as a string should be straightforward. Suggested Implementation I have been pondering your question and decided to play with some timings and possible implementations. My idea was to map the unicode characters in 0x0621-0x063A, 0x0641-0x064A (26 + 10 = 36 letters) into the lower 6 bits of a <code>uint16</code>, and the characters 0x064B-0x0652 (8 diacritics) to the next higher 3 bits, assuming these are in fact the only diacritics you need: <pre class="prettyprint"><code>def encode_py(char): char = ord(char) - 0x0621 if char >= 0x20: char -= 5 return char def combine_py(char, diacritic): return encode_py(char) | ((ord(diacritic) - 0x064A) << 6) </code></pre> In numpy terms: <pre class="prettyprint"><code>def encode_numpy(chars): chars = chars - 0x0621 return np.subtract(chars, 5, where=chars > 0x20, out=chars) def combine_numpy(chars, diacritics): chars = encode_numpy(chars) chars |= (diacritics - 0x064A) << 6 return chars </code></pre> You can choose to encode further to shorten the representation slightly, but I would not recommend it. This representation has the advantage of being verse-independent, so you can compare portions of different verses, as well as not worry about which representation you're going to get depending on how many verses you encoded together. You can even mask out the top bits of all the codes to compare the raw characters, without diacritics. So let's say that your verse is a collection of randomly generated numbers in those ranges, with diacritics randomly generated to follow one letter each at most. We can generate a string of length around million pretty easily for comparative purposes: <pre class="prettyprint"><code>import random random.seed(0xB00B5) alphabet = list(range(0x0621, 0x063B)) + list(range(0x0641, 0x064B)) diactitics = list(range(0x064B, 0x0653)) alphabet = [chr(x) for x in alphabet] diactitics = [chr(x) for x in diactitics] def sample(n=1000000, d=0.25): while n: yield random.choice(alphabet) n -= 1 if n and random.random() < d: yield random.choice(diactitics) n -= 1 data = ''.join(sample()) </code></pre> This data has completely randomly distributed characters, with approximately a 25% chance of any character being followed by a diacritic. It takes just a few seconds to generate on my not-too-overpowered laptop. The numpy conversion would look like this: <pre class="prettyprint"><code>def convert_numpy(verse): arr = np.array([verse]).view(np.uint32) mask = np.empty(arr.shape, dtype=np.bool) mask[:-1] = (arr[1:] >= 0x064B) mask[-1] = False combined = combine_numpy(chars=arr[mask], diacritics=arr[1:][mask[:-1]]) smeared = mask.copy() smeared[1:] |= mask[:-1] single = encode_numpy(arr[~smeared]) ind = np.flatnonzero(mask) nnz = ind.size ind -= np.arange(nnz) output = np.empty(arr.size - nnz, dtype=np.uint16) output[ind] = combined # mask of unmodified elements out_mask = np.ones(output.size, dtype=np.bool) out_mask[ind] = False output[out_mask] = single return output </code></pre> Benchmarks And now let's <code>%timeit</code> to see how it goes. First, here are the other implementations. I convert everything to a numpy array or a list of integers for fair comparison. I've also made minor modifications to make the functions return lists of the same quantities to validate accuracy: <pre class="prettyprint"><code>from itertools import tee, zip_longest from functools import reduce def is_diacritic(c): return ord(c) >= 0x064B def pairwise(iterable, fillvalue): """ Slightly modified itertools pairwise recipe s -> (s0,s1), (s1,s2), (s2, s3), ... """ a, b = tee(iterable) next(b, None) return zip_longest(a, b, fillvalue=fillvalue) def combine_py2(char, diacritic): return char | ((ord(diacritic) - 0x064A) << 6) def convert_FHTMitchell(verse): def convert(verse): was_diacritic = False # variable to keep track of diacritics -- stops us checking same character twice # fillvalue will not be encoded but ensures last char is read for this_char, next_char in pairwise(verse, fillvalue='-'): if was_diacritic: # last next_char (so this_char) is diacritic was_diacritic = False elif is_diacritic(next_char): yield combine_py(this_char, next_char) was_diacritic = True else: yield encode_py(this_char) return list(convert(verse)) def convert_tobias_k_1(verse): return reduce(lambda lst, x: lst + [encode_py(x)] if not is_diacritic(x) else lst[:-1] + [combine_py2(lst[-1], x)], verse, []) def convert_tobias_k_2(verse): res = [] for x in verse: if not is_diacritic(x): res.append(encode_py(x)) else: res[-1] = combine_py2(res[-1], x) return res def convert_tobias_k_3(verse): return [combine_py(x, y) if y and is_diacritic(y) else encode_py(x) for x, y in zip_longest(verse, verse[1:], fillvalue="") if not is_diacritic(x)] </code></pre> Now for the timings: <pre class="prettyprint"><code>%timeit result_FHTMitchell = convert_FHTMitchell(data) 338 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit result_tobias_k_1 = convert_tobias_k_1(data) Aborted, took > 5min to run. Appears to scale quadratically with input size: not OK! %timeit result_tobias_k_2 = convert_tobias_k_2(data) 357 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit result_tobias_k_3 = convert_tobias_k_3(data) 466 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit result_numpy = convert_numpy(data) 30.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre> A comparison of the resulting arrays/lists shows that they are equal as well: <pre class="prettyprint"><code>np.array_equal(result_FHTMitchell, result_tobias_k_2) # True np.array_equal(result_tobias_k_2, result_tobias_k_3) # True np.array_equal(result_tobias_k_3, result_numpy) # True </code></pre> I'm using <code>array_equal</code> here because it performs all the necessary type conversions to verify the actual data. So the moral of the story is that there are lots of ways to do this, and parsing a few millions of characters shouldn't be prohibitively expensive on its own, until you get into cross referencing and other truly time-consuming tasks. The main thing to take from this is not to use <code>reduce</code> on lists, since you will be reallocating a lot more than you need to. Even a simple <code>for</code> loop will work fine for your purposes. Even though numpy is about ten times faster than the other implementations, it doesn't give a huge advantage. Decoding For the sake of completeness, here is a function to decode your results: <pre class="prettyprint"><code>def decode(arr): mask = (arr > 0x3F) nnz = np.count_nonzero(mask) ind = np.flatnonzero(mask) + np.arange(nnz) diacritics = (arr[mask] >> 6) + 41 characters = (arr & 0x3F) characters[characters >= 27] += 5 output = np.empty(arr.size + nnz, dtype='U1').view(np.uint32) output[ind] = characters[mask] output[ind + 1] = diacritics output_mask = np.zeros(output.size, dtype=np.bool) output_mask[ind] = output_mask[ind + 1] = True output[~output_mask] = characters[~mask] output += 0x0621 return output.base.view(f'U{output.size}').item() </code></pre> As a side note, the work I did here inspired this question: Converting numpy arrays of code points to and from strings

<code>map</code> does not seem to be the right tool for the job. You do not want to map characters to other characters, but group them together. Instead, you might try <code>reduce</code> (or <code>functools.reduce</code> in Python 3). Here, I use <code>isalpha</code> to test what kind of character it is; you might need something else. <pre class="prettyprint"><code>>>> is_diacritic = lambda x: not x.isalpha() >>> verse = "XXA)L_I!I%M<LLL>MMQ*Q" >>> reduce(lambda lst, x: lst + [x] if not is_diacritic(x) else lst[:-1] + [lst[-1]+x], verse, []) ['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q'] </code></pre> However, this is barely readable and also creates lots of intermediate lists. Better just use a boring old <code>for</code> loop, even if you explicitly asked for something else: <pre class="prettyprint"><code>res = [] for x in verse: if not is_diacritic(x): res.append(x) else: res[-1] += x </code></pre> <hr> By iterating pairs of consecutive characters, e.g. using <code>zip(verse, verse[1:])</code> (i.e. <code>(1,2), (2,3),...</code>, not <code>(1,2), (3,4), ...</code>), you could indeed also use a list comprehension, but I'd still vote for the <code>for</code> loop for readability. <pre class="prettyprint"><code>>>> [x + y if is_diacritic(y) else x ... for x, y in zip_longest(verse, verse[1:], fillvalue="") ... if not is_diacritic(x)] ... ['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q'] </code></pre> You could even do the same using <code>map</code> and lambda, but you'd also need to <code>filter</code> first, with another lambda, making the whole thing orders of magnitude uglier and harder to read.

Encoding Arabic letters with their diacritics (if exists)

Tags:

python

python-3.x

lambda

I'm working on a deep learning project in which we use RNN. I want to encode the data before it's fed to the network. Input is Arabic verses, which have diacritics that are treated as separate characters in Python. I should encode/represent the character with the character following it with a number if the character following it is a diacritic, else I only encode the character.

Doing so for millions of verses, was hoping to use lambda with map. However, I cannot iterate with two characters at a time, i.e., was hoping for:

map(lambda ch, next_ch: encode(ch + next_ch) if is_diacritic(next_ch) else encode(ch), verse)

My intention behind the question is finding the fastest way to accomplish the above. No restriction on lambda functions, but for loop answers are not what I'm looking for.

A close example for non-Arabians, assume you want to encode the following text:

 XXA)L_I!I%M<LLL>MMQ*Q

You want to encode the letter after concatenating it with the letter following it if it's a special character, else just encode the letter only.

Output:

['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

For Arabians:

Verse example:

"قفا نبك من ذِكرى حبيب ومنزل بسِقطِ اللّوى بينَ الدَّخول فحَوْمل"

Diacritics are these small symbols above the letter(i.e., ّ , ْ )

[Update]

Range of diacritics starts at 64B HEX or 1611 INT and ends at 652 HEX or 1618 INT.

And letters 621 HEX - 1569 INT to 63A HEX - 1594 INT and from 641 HEX - 1601 INT to 64A HEX - 1610 INT

A letter can have at most one diacritic.

Extra information:

A similar encoding methodology to what I'm doing is representing the binary form of the verse as a matrix with shape (number of bits needed, number of characters in a verse). Both the number of bits and number of characters are calculated after we combine each letter with its diacritic if it exists.

For example, assume the verse is the following, and diacritics are special characters:

X+Y_XX+YYYY_

The alphabet different combinations are:

['X', 'X+', 'X_', 'Y', 'Y+', 'Y_']

Hence I need 3 bits (at least) to represent these 6 characters, so the number of bits needed is 3

Consider the following encodings:

{
'X' : 000,
'X+': 001,
'X_': 010,
'Y':  011,
'Y+': 100,
'Y_': 101,
}

And I get to represent the example in a matrix as (binary representation is vertical):

X+     Y_    X    X+    Y    Y    Y    Y_
0      1     0    0     0    0    0    1
0      0     0    0     1    1    1    0
1      1     0    1     1    1    1    1

Which is why I am looking to combine diacritics with letters first.

Note: Iterate over a string 2 (or n) characters at a time in Python and Iterating each character in a string using Python do not give the intended answer.

483

asked Jan 23 '19 09:01

ndrwnaguib

2 Answers

I'm going to throw my hat into the ring with numpy here. You can convert a string into a useable format with

arr = np.array([verse]).view(np.uint32)

You can mask the locations where the following character is diacritical:

mask = np.empty(arr.shape, dtype=np.bool)
np.bitwise_and((arr[1:] > lower), (arr[1:] < upper), out=mask[:-1])
mask[-1] = False

Here, the range [upper, lower] is a made up way for checking for diacritics. Implement the actual check however you like. In this example, I used the full-blown form of bitwise_and with empty to avoid a potentially expensive append of the last element.

Now if you have a numerical method for encoding your code points to a number, which I'm sure you can vectorize, you can do something like:

combined = combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]])

To get the remaining, uncombined characters, you would have to remove both the diactitics and the characters they bind to. The easiest way I can think of doing this is smearing the mask to the right and negating it. Again, I assume that you have a vectorized method to encode the single characters as well:

smeared = mask.copy()
smeared[1:] |= mask[:-1]
single = encode(arr[~smeared])

Combining the result into a final array is conceptually simple but takes a couple of steps. The result will be np.count_nonzeros(mask) elements shorter than the input, since diacritics are being removed. We need to shift all the mask elements by the amount of their index. Here's one way to do it:

ind = np.flatnonzero(mask)
nnz = ind.size
ind -= np.arange(nnz)

output = np.empty(arr.size - nnz, dtype='U1')
output[ind] = combined

# mask of unmodified elements
out_mask = np.ones(output.size, dtype=np.bool)
out_mask[ind] = False
output[out_mask] = single

The reason I'm suggesting numpy is that it should be able to handle a few million characters in a matter of seconds this way. Getting the output back as a string should be straightforward.

Suggested Implementation

I have been pondering your question and decided to play with some timings and possible implementations. My idea was to map the unicode characters in 0x0621-0x063A, 0x0641-0x064A (26 + 10 = 36 letters) into the lower 6 bits of a uint16, and the characters 0x064B-0x0652 (8 diacritics) to the next higher 3 bits, assuming these are in fact the only diacritics you need:

def encode_py(char):
    char = ord(char) - 0x0621
    if char >= 0x20:
        char -= 5
    return char

def combine_py(char, diacritic):
    return encode_py(char) | ((ord(diacritic) - 0x064A) << 6)

In numpy terms:

def encode_numpy(chars):
    chars = chars - 0x0621
    return np.subtract(chars, 5, where=chars > 0x20, out=chars)

def combine_numpy(chars, diacritics):
    chars = encode_numpy(chars)
    chars |= (diacritics - 0x064A) << 6
    return chars

You can choose to encode further to shorten the representation slightly, but I would not recommend it. This representation has the advantage of being verse-independent, so you can compare portions of different verses, as well as not worry about which representation you're going to get depending on how many verses you encoded together. You can even mask out the top bits of all the codes to compare the raw characters, without diacritics.

So let's say that your verse is a collection of randomly generated numbers in those ranges, with diacritics randomly generated to follow one letter each at most. We can generate a string of length around million pretty easily for comparative purposes:

import random

random.seed(0xB00B5)

alphabet = list(range(0x0621, 0x063B)) + list(range(0x0641, 0x064B))
diactitics = list(range(0x064B, 0x0653))

alphabet = [chr(x) for x in alphabet]
diactitics = [chr(x) for x in diactitics]

def sample(n=1000000, d=0.25):
    while n:
        yield random.choice(alphabet)
        n -= 1
        if n and random.random() < d:
            yield random.choice(diactitics)
            n -= 1

data = ''.join(sample())

This data has completely randomly distributed characters, with approximately a 25% chance of any character being followed by a diacritic. It takes just a few seconds to generate on my not-too-overpowered laptop.

The numpy conversion would look like this:

def convert_numpy(verse):
    arr = np.array([verse]).view(np.uint32)
    mask = np.empty(arr.shape, dtype=np.bool)
    mask[:-1] = (arr[1:] >= 0x064B)
    mask[-1] = False

    combined = combine_numpy(chars=arr[mask], diacritics=arr[1:][mask[:-1]])

    smeared = mask.copy()
    smeared[1:] |= mask[:-1]
    single = encode_numpy(arr[~smeared])

    ind = np.flatnonzero(mask)
    nnz = ind.size
    ind -= np.arange(nnz)

    output = np.empty(arr.size - nnz, dtype=np.uint16)
    output[ind] = combined

    # mask of unmodified elements
    out_mask = np.ones(output.size, dtype=np.bool)
    out_mask[ind] = False
    output[out_mask] = single

    return output

Benchmarks

And now let's %timeit to see how it goes. First, here are the other implementations. I convert everything to a numpy array or a list of integers for fair comparison. I've also made minor modifications to make the functions return lists of the same quantities to validate accuracy:

from itertools import tee, zip_longest
from functools import reduce

def is_diacritic(c):
    return ord(c) >= 0x064B

def pairwise(iterable, fillvalue):
    """ Slightly modified itertools pairwise recipe
    s -> (s0,s1), (s1,s2), (s2, s3), ... 
    """
    a, b = tee(iterable)
    next(b, None)
    return zip_longest(a, b, fillvalue=fillvalue)

def combine_py2(char, diacritic):
    return char | ((ord(diacritic) - 0x064A) << 6)

def convert_FHTMitchell(verse):
    def convert(verse):
        was_diacritic = False  # variable to keep track of diacritics -- stops us checking same character twice

        # fillvalue will not be encoded but ensures last char is read
        for this_char, next_char in pairwise(verse, fillvalue='-'):
            if was_diacritic:  # last next_char (so this_char) is diacritic
                was_diacritic = False
            elif is_diacritic(next_char):
                yield combine_py(this_char, next_char)
                was_diacritic = True
            else:
                yield encode_py(this_char)

    return list(convert(verse))

def convert_tobias_k_1(verse):
    return reduce(lambda lst, x: lst + [encode_py(x)] if not is_diacritic(x) else lst[:-1] + [combine_py2(lst[-1], x)], verse, [])

def convert_tobias_k_2(verse):
    res = []
    for x in verse:
        if not is_diacritic(x):
            res.append(encode_py(x))
        else:
            res[-1] = combine_py2(res[-1], x)
    return res

def convert_tobias_k_3(verse):
    return [combine_py(x, y) if y and is_diacritic(y) else encode_py(x) for x, y in zip_longest(verse, verse[1:], fillvalue="") if not is_diacritic(x)]

Now for the timings:

%timeit result_FHTMitchell = convert_FHTMitchell(data)
338 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_1 = convert_tobias_k_1(data)
Aborted, took > 5min to run. Appears to scale quadratically with input size: not OK!

%timeit result_tobias_k_2 = convert_tobias_k_2(data)
357 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_3 = convert_tobias_k_3(data)
466 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_numpy = convert_numpy(data)
30.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A comparison of the resulting arrays/lists shows that they are equal as well:

np.array_equal(result_FHTMitchell, result_tobias_k_2)  # True
np.array_equal(result_tobias_k_2, result_tobias_k_3)   # True
np.array_equal(result_tobias_k_3, result_numpy)        # True

I'm using array_equal here because it performs all the necessary type conversions to verify the actual data.

So the moral of the story is that there are lots of ways to do this, and parsing a few millions of characters shouldn't be prohibitively expensive on its own, until you get into cross referencing and other truly time-consuming tasks. The main thing to take from this is not to use reduce on lists, since you will be reallocating a lot more than you need to. Even a simple for loop will work fine for your purposes. Even though numpy is about ten times faster than the other implementations, it doesn't give a huge advantage.

Decoding

For the sake of completeness, here is a function to decode your results:

def decode(arr):
    mask = (arr > 0x3F)
    nnz = np.count_nonzero(mask)
    ind = np.flatnonzero(mask) + np.arange(nnz)

    diacritics = (arr[mask] >> 6) + 41
    characters = (arr & 0x3F)
    characters[characters >= 27] += 5

    output = np.empty(arr.size + nnz, dtype='U1').view(np.uint32)
    output[ind] = characters[mask]
    output[ind + 1] = diacritics

    output_mask = np.zeros(output.size, dtype=np.bool)
    output_mask[ind] = output_mask[ind + 1] = True
    output[~output_mask] = characters[~mask]

    output += 0x0621

    return output.base.view(f'U{output.size}').item()

As a side note, the work I did here inspired this question: Converting numpy arrays of code points to and from strings

200

answered Sep 22 '22 15:09

Mad Physicist

map does not seem to be the right tool for the job. You do not want to map characters to other characters, but group them together. Instead, you might try reduce (or functools.reduce in Python 3). Here, I use isalpha to test what kind of character it is; you might need something else.

>>> is_diacritic = lambda x: not x.isalpha()
>>> verse = "XXA)L_I!I%M<LLL>MMQ*Q"
>>> reduce(lambda lst, x: lst + [x] if not is_diacritic(x) else lst[:-1] + [lst[-1]+x], verse, [])
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

However, this is barely readable and also creates lots of intermediate lists. Better just use a boring old for loop, even if you explicitly asked for something else:

res = []
for x in verse:
    if not is_diacritic(x):
        res.append(x)
    else:
        res[-1] += x

By iterating pairs of consecutive characters, e.g. using zip(verse, verse[1:]) (i.e. (1,2), (2,3),..., not (1,2), (3,4), ...), you could indeed also use a list comprehension, but I'd still vote for the for loop for readability.

>>> [x + y if is_diacritic(y) else x
...  for x, y in zip_longest(verse, verse[1:], fillvalue="")
...  if not is_diacritic(x)]
...
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

You could even do the same using map and lambda, but you'd also need to filter first, with another lambda, making the whole thing orders of magnitude uglier and harder to read.

answered Sep 23 '22 15:09

tobias_k

Related questions
                            
                                manage.py doesn't log to stdout/stderr in Docker on Raspberry Pi
                            
                                Mean Euclidean distance in Tensorflow
                            
                                Keras - how to get unnormalized logits instead of probabilities
                            
                                Read Parquet file stored in S3 with AWS Lambda (Python 3)
                            
                                How to properly close mysql connections in sqlalchemy?
                            
                                How to make cerberus required rule depends on condition
                            
                                How to solve binary mode doesn't take an encoding argument
                            
                                Using python3 in shell script in crontab
                            
                                Inserting comments into jupyter notebook
                            
                                Spotipy Refreshing a token with authorization code flow
                            
                                Why is the Mean Average Percentage Error(mape) extremely high?
                            
                                How to reuse a selenium browser session
                            
                                How to group functions without side effects?
                            
                                How does the Gensim Fasttext pre-trained model get vectors for out-of-vocabulary words?
                            
                                no such element: Unable to locate element using chromedriver and Selenium in production environment
                            
                                parallelize 'for' loop in Python 3
                            
                                How to use OpenCV ConnectedComponents to get the images
                            
                                Pytorch saving model UserWarning: Couldn't retrieve source code for container of type Network
                            
                                What is 'extra' in pypi dependency?
                            
                                Twitter Streaming API - urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With