I'm getting data from channel which is not aware about UTF-8 rules. So sometimes when UTF-8 is using multiple bytes to code one character and I try to convert part of received data into text I'm getting error during conversion. By nature of interface (stream without any end) I'm not able to find out when data are full. Thus I need to handle partial utf-8 decoding. Basically I need to decode what I can and store partial data. Stored partial data will be added as prefix to next data. My question is if there is some neat function in python to allow it?
[EDIT] Just to ensure you I know about function in docs.python
bytes.decode(encoding="utf-8", errors="ignore")
but the issue is it would not return me where is the error and so I can not know how much bytes from end I shall keep.
You can call the codecs module to the rescue. It gives you directly a incremental decoder, that does exactly what you need:
import codecs
dec = codecs.getincrementaldecoder('utf8')()
You can feed it with: dec.decode(input)
and when it is over, optionally add a dec.decode(bytes(), True)
to force it to cleanup any stored state.
The test becomes:
>>> def test(arr):
dec = codecs.getincrementaldecoder('utf8')()
recvString = ""
for i in range(len(arr)):
recvString += dec.decode(arr[i:i+1])
sys.stdout.write("%02d : %s\n" % (i, recvString))
recvString += dec.decode(bytes(), True) # will choke on incomplete input...
return recvString == arr.decode('utf8')
>>> testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
>>> test(testUtf8)
00 : a
01 : a
02 : aŽ
03 : aŽl
04 : aŽlu
05 : aŽlu
06 : aŽluť
07 : aŽluťo
08 : aŽluťou
09 : aŽluťou
10 : aŽluťouč
11 : aŽluťoučk
12 : aŽluťoučk
13 : aŽluťoučký
14 : aŽluťoučký
15 : aŽluťoučký k
16 : aŽluťoučký k
17 : aŽluťoučký ků
18 : aŽluťoučký ků
19 : aŽluťoučký kůň
True
So far I come up with not so nice function:
def decodeBytesUtf8Safe(toDec):
"""
decodes byte array in utf8 to string. It can handle case when end of byte array is
not complete thus making utf8 error. in such case text is translated only up to error.
Rest of byte array (from error to end) is returned as second parameter and can be
combined with next byte array and decoded next time.
:param toDec: bytes array to be decoded a(eg bytes("abc","utf8"))
:return:
1. decoded string
2. rest of byte array which could not be encoded due to error
"""
okLen = len(toDec)
outStr = ""
while(okLen>0):
try:
outStr = toDec[:okLen].decode("utf-8")
except UnicodeDecodeError as ex:
okLen -= 1
else:
break
return outStr,toDec[okLen:]
you can test it using script:
def test(arr):
expStr = arr.decode("utf-8")
errorCnt = 0
for i in range(len(arr)+1):
decodedTxt, rest = decodeBytesUtf8Safe(arr[0:i])
decodedTxt2, rest2 = decodeBytesUtf8Safe(rest+arr[i:])
recvString = decodedTxt+decodedTxt2
sys.stdout.write("%02d ; %s (%s - %s )\n"%(i,recvString,decodedTxt, decodedTxt2))
if(expStr != recvString):
print("Error when divided at %i"%(i))
errorCnt += 1
return errorCnt
testUtf8 = bytes([0x61, 0xc5, 0xbd, 0x6c, 0x75, 0xc5, 0xa5, 0x6f, 0x75, 0xc4, 0x8d, 0x6b, 0xc3, 0xbd, 0x20, 0x6b, 0xc5, 0xaf, 0xc5, 0x88])
err = test(testUtf8)
print("total errors %i"%(err))
it shall give you the output:
00 ; aŽluťoučký kůň ( - aŽluťoučký kůň )
01 ; aŽluťoučký kůň (a - Žluťoučký kůň )
02 ; aŽluťoučký kůň (a - Žluťoučký kůň )
03 ; aŽluťoučký kůň (aŽ - luťoučký kůň )
04 ; aŽluťoučký kůň (aŽl - uťoučký kůň )
05 ; aŽluťoučký kůň (aŽlu - ťoučký kůň )
06 ; aŽluťoučký kůň (aŽlu - ťoučký kůň )
07 ; aŽluťoučký kůň (aŽluť - oučký kůň )
08 ; aŽluťoučký kůň (aŽluťo - učký kůň )
09 ; aŽluťoučký kůň (aŽluťou - čký kůň )
10 ; aŽluťoučký kůň (aŽluťou - čký kůň )
11 ; aŽluťoučký kůň (aŽluťouč - ký kůň )
12 ; aŽluťoučký kůň (aŽluťoučk - ý kůň )
13 ; aŽluťoučký kůň (aŽluťoučk - ý kůň )
14 ; aŽluťoučký kůň (aŽluťoučký - kůň )
15 ; aŽluťoučký kůň (aŽluťoučký - kůň )
16 ; aŽluťoučký kůň (aŽluťoučký k - ůň )
17 ; aŽluťoučký kůň (aŽluťoučký k - ůň )
18 ; aŽluťoučký kůň (aŽluťoučký ků - ň )
19 ; aŽluťoučký kůň (aŽluťoučký ků - ň )
20 ; aŽluťoučký kůň (aŽluťoučký kůň - )
total errors 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With