Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixed length data field and variable length utf-8 encoding

I have a Python project where I have a fixed byte-length text field (NOT FIXED CHAR-LENGTH FIELD) in a comm protocol that contains a utf-8 encoded, NULL padded, NULL terminated string.

I need to ensure that a string fits into the fixed byte-length field. Since utf-8 is a variable width encoding, this makes using brute force to truncate the string at a fixed byte length dicey since you could possibly leave part of a multi-byte character dangling at the end.

Is there a module/method/function/etc that can help me with truncating utf-8 variable width encoded strings to a fixed byte-length?

Something that does Null padding and termination would be a bonus.

This seems like a nut that would have already been cracked. I don't want to reinvent something if it already exists.

like image 697
Dave L. Avatar asked Apr 12 '26 10:04

Dave L.


2 Answers

Let Python detect and eliminate any partial or invalid characters.

byte_str = uni_str.encode('utf-8')
byte_str = byte_str[:size].decode('utf-8', 'ignore').encode('utf-8')

This works because the UTF-8 spec encodes the number of following bytes in the first byte of a character, so the missing bytes can be easily detected.

Edit: Here's the results from this code using a random oriental character string I pulled from another question. The first number is the maximum size, the second is the actual number of bytes in the UTF-8 string.

45 45 具有靜電產生裝置之影像輸入裝置
44 42 具有靜電產生裝置之影像輸入裝
43 42 具有靜電產生裝置之影像輸入裝
42 42 具有靜電產生裝置之影像輸入裝
41 39 具有靜電產生裝置之影像輸入
40 39 具有靜電產生裝置之影像輸入
39 39 具有靜電產生裝置之影像輸入
38 36 具有靜電產生裝置之影像輸
37 36 具有靜電產生裝置之影像輸
36 36 具有靜電產生裝置之影像輸
35 33 具有靜電產生裝置之影像
34 33 具有靜電產生裝置之影像
33 33 具有靜電產生裝置之影像
32 30 具有靜電產生裝置之影
31 30 具有靜電產生裝置之影
like image 168
Mark Ransom Avatar answered Apr 14 '26 22:04

Mark Ransom


It is very easy to see in a UTF-8 stream whether a given byte is at the start (or not) of a given character's byte stream. If the byte is of the form 10xxxxxx then it is a non-initial byte of a character, if the byte is of the form 0xxxxxx it is a single byte character, and other bytes are the initial bytes of a multi-byte character.

As such, you can build your own function without too much difficulty. Just ensure that the last character you add to your field is either of the form 0xxxxxx, or is of the form 10xxxxxx where the next character (which you're not adding) is not of the form 10xxxxxx. I.e. you make sure you've just added a one-byte UTF-8 character or the last byte of a multi-byte UTF-8 character. You can then just add 0s to fill in the rest of your field.

like image 22
borrible Avatar answered Apr 15 '26 00:04

borrible



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!