Are null bytes allowed in unicode strings?
I don't ask about utf8, I mean the high level object representation of a unicode string.
Background
We store unicode strings containing null bytes via Python in PostgreSQL.
The strings cut at the null byte if we read it again.
The code 0x0000 is the Unicode string terminator for a null-terminated string. A single null byte is not sufficient for this code, because many Unicode characters contain null bytes as either the high or the low byte. An example is the letter A, for which the character code is 0x0041.
No, NUL cannot be in any arbitrary place in a UTF-8 string, the extension bytes may not be NUL.
� - Null: U+0000 - Unicode Character Table.
The str. replace() method will remove occurrences of the \x00 character by replacing them with an empty string. Copied! The \x00 character is a Null-character that represents a HEX byte with all bits at 0.
About the database side, PostgreSQL itself does not allow null byte ('\0'
) in a string on char/text/varchar fields, so if you try to store a string containing it you receive an error. Example:
postgres=# SELECT convert_from('foo\000bar'::bytea, 'unicode');
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0x00
If you really need to store such information, then you can use bytea
data type on PostgreSQL side. Make to sure to encode it correctly.
Python itself is perfectly capable of having both byte strings and Unicode strings with null characters having a value of zero. However if you call out to a library implemented in C, that library may use the C convention of stopping at the first null character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With