Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python decode partial utf-8 byte array

Tags:

python

utf-8

I'm getting data from channel which is not aware about UTF-8 rules. So sometimes when UTF-8 is using multiple bytes to code one character and I try to convert part of received data into text I'm getting error during conversion. By nature of interface (stream without any end) I'm not able to find out when data are full. Thus I need to handle partial utf-8 decoding. Basically I need to decode what I can and store partial data. Stored partial data will be added as prefix to next data. My question is if there is some neat function in python to allow it?

[EDIT] Just to ensure you I know about function in docs.python

 bytes.decode(encoding="utf-8", errors="ignore")

but the issue is it would not return me where is the error and so I can not know how much bytes from end I shall keep.

like image 820
Vit Bernatik Avatar asked Dec 24 '22 00:12

Vit Bernatik


2 Answers

You can call the codecs module to the rescue. It gives you directly a incremental decoder, that does exactly what you need:

import codecs

dec = codecs.getincrementaldecoder('utf8')()

You can feed it with: dec.decode(input) and when it is over, optionally add a dec.decode(bytes(), True) to force it to cleanup any stored state.

The test becomes:

>>> def test(arr):
    dec = codecs.getincrementaldecoder('utf8')()
    recvString = ""
    for i in range(len(arr)):
        recvString += dec.decode(arr[i:i+1])
        sys.stdout.write("%02d : %s\n" % (i, recvString))
    recvString += dec.decode(bytes(), True) # will choke on incomplete input...
    return recvString == arr.decode('utf8')

>>> testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
>>> test(testUtf8)
00 : a
01 : a
02 : aŽ
03 : aŽl
04 : aŽlu
05 : aŽlu
06 : aŽluť
07 : aŽluťo
08 : aŽluťou
09 : aŽluťou
10 : aŽluťouč
11 : aŽluťoučk
12 : aŽluťoučk
13 : aŽluťoučký
14 : aŽluťoučký 
15 : aŽluťoučký k
16 : aŽluťoučký k
17 : aŽluťoučký ků
18 : aŽluťoučký ků
19 : aŽluťoučký kůň
True
like image 79
Serge Ballesta Avatar answered Dec 26 '22 14:12

Serge Ballesta


So far I come up with not so nice function:

def decodeBytesUtf8Safe(toDec):
    """
    decodes byte array in utf8 to string. It can handle case when end of byte array is
    not complete thus making utf8 error. in such case text is translated only up to error.
    Rest of byte array (from error to end) is returned as second parameter and can be
    combined with next byte array and decoded next time.
    :param toDec: bytes array to be decoded a(eg bytes("abc","utf8"))
    :return:
     1. decoded string
     2. rest of byte array which could not be encoded due to error
    """
    okLen = len(toDec)
    outStr = ""
    while(okLen>0):
        try:
            outStr = toDec[:okLen].decode("utf-8")
        except UnicodeDecodeError as ex:
            okLen -= 1
        else:
            break
    return outStr,toDec[okLen:]

you can test it using script:

def test(arr):
    expStr = arr.decode("utf-8")
    errorCnt = 0
    for i in range(len(arr)+1):
        decodedTxt, rest = decodeBytesUtf8Safe(arr[0:i])
        decodedTxt2, rest2 = decodeBytesUtf8Safe(rest+arr[i:])
        recvString = decodedTxt+decodedTxt2
        sys.stdout.write("%02d ; %s (%s - %s )\n"%(i,recvString,decodedTxt, decodedTxt2))
        if(expStr != recvString):
            print("Error when divided at %i"%(i))
            errorCnt += 1
    return errorCnt

testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
err = test(testUtf8)
print("total errors %i"%(err))

it shall give you the output:

00 ; aŽluťoučký kůň ( - aŽluťoučký kůň )
01 ; aŽluťoučký kůň (a - Žluťoučký kůň )
02 ; aŽluťoučký kůň (a - Žluťoučký kůň )
03 ; aŽluťoučký kůň (aŽ - luťoučký kůň )
04 ; aŽluťoučký kůň (aŽl - uťoučký kůň )
05 ; aŽluťoučký kůň (aŽlu - ťoučký kůň )
06 ; aŽluťoučký kůň (aŽlu - ťoučký kůň )
07 ; aŽluťoučký kůň (aŽluť - oučký kůň )
08 ; aŽluťoučký kůň (aŽluťo - učký kůň )
09 ; aŽluťoučký kůň (aŽluťou - čký kůň )
10 ; aŽluťoučký kůň (aŽluťou - čký kůň )
11 ; aŽluťoučký kůň (aŽluťouč - ký kůň )
12 ; aŽluťoučký kůň (aŽluťoučk - ý kůň )
13 ; aŽluťoučký kůň (aŽluťoučk - ý kůň )
14 ; aŽluťoučký kůň (aŽluťoučký -  kůň )
15 ; aŽluťoučký kůň (aŽluťoučký  - kůň )
16 ; aŽluťoučký kůň (aŽluťoučký k - ůň )
17 ; aŽluťoučký kůň (aŽluťoučký k - ůň )
18 ; aŽluťoučký kůň (aŽluťoučký ků - ň )
19 ; aŽluťoučký kůň (aŽluťoučký ků - ň )
20 ; aŽluťoučký kůň (aŽluťoučký kůň -  )
total errors 0
like image 29
Vit Bernatik Avatar answered Dec 26 '22 13:12

Vit Bernatik