I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Which means that the unallowed are:
1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.
Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).
So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)
NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.
NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).
There are two methods of encoding binary data in an XML document. The base64Binary encoding makes better use of the available XML characters, and on average a base64-encoded binary field is 2/3 the size of its hexBinary equivalent. Base64Binary is widely used by the MIME format and by various XML-based standards.
Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.
Base64 image representation can be directly placed within html to render an image. Binary takes up less space. And benefits from greater network effects and standardization. E.g. if you want to use amazon simple secure storage S3 you have to store a binary file.
Base64 is commonly used in a number of applications including email via MIME, and storing complex data in XML. One common application of Base64 encoding on the web is to encode binary data so it can be included in a data: URL.
Thank you Aya for the Asci85 link, there are very good ideas.
I developed them below for our case.
For 1 byte characters (0xxxxxxx): 96 possibilites per byte
+
UTF-8 ASCII chars 0xxxxxxx = +2^7-
UTF-8 Control chars 000xxxxx = -2^5+
XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3-
XML entity unallowed chars (<, >, &) = -3EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...
For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes
+
UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11-
UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes
+
UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16-
UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11-
Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.
So let's see what combinations we can do on a 3-byte (24bit) space:
There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).
Total number of possibilities:
How much overhead does that mean?
18% overhead
. Way better than Base64's 33% overhead and Ascii85's 25% overhead!EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.
The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.
Good coding lengths candidates:
NB: 0,84 is the average "usefulness" of a space bit (20,32/24).
We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).
The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).
Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.
That's the next challenge for this question. Who wants to to play? :)
20 binary bits to encode: ABCDEFGHIJKLMNOPQRST
Resulting UTF-8 string named "encoded": 24 bits long
Mathematical encoding algorithm (not based on any known programming language):
If GH != 00 && NO != 00:
encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)
If ABCD != 0000:
If GH == 00 && NO == 00: # 16 bits to encode
encoded = 0010ABCD 01EFIJKL 01MPQRST
Else If GH == 00: # 18 bits to encode, 18 space bits with restrictions (1-byte UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
encoded = 0NOFIJKL 110ABCDE 10MPQRST
Else If NO == 00: # 18 bits to encode
encoded = 110ABCDE 10MPQRST 0GHFIJKL
If ABCD == 0000: # 16 bits to encode
encoded = 0011EFGH 01IJKLMN 01OPQRST
On "encoded" variable apply:
convert < (0x3C) to Line Feed (0x0A)
convert > (0x3E) to Cariage Return (0x0D)
convert & (0x26) to TAB (0x09)
And that's how you get a 20% overhead only.
This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).
It's worse than that: you don't actually have 215 different byte values you can use. The resulting binary data have to be valid in whatever encoding the XML is represented in (which is almost certainly UTF-8), which means that many, many byte sequences are forbidden. 0xc2 followed by 0x41 would be just one random example. XML is text (a sequence of Unicode characters), not binary data. When transmitted, it is encoded using some encoding (which is almost alwats UTF-8). If you try to treat it as binary data, then you are, in my opinion, asking for way more trouble than it's worth dealing with.
If you still want to do this...
XML is text. So let's not try to encode your binary data as binary data. That will not lead to an easy or obvious way to showhorn it into an XML document. Let's try instead encoding your binary data as text!
Let's try one very simple encoding:
This will mean you exclusively use characters from planes 1 through 16. All of the restricted characters are in plane 0 (the BMP), so you are safe here.
When you then encode this XML document as UTF-8 for transmission, each of these characters will require 4 bytes to encode. So you consume 32 bits for every 20 bits of original data, which is 60% overhead compared to pure binary encoding of the original data. This is worse than base64's 33%, which makes it a terrible idea.
This encoding scheme is slightly wasteful because it makes no use of BMP characters. Can we use BMP characters to make it better? Not trivially. 20 is the largest size we can use for the groups (log(0x10FFFF) ~ 20.09
). We could remap out scheme to use some as manu BMP characters as possible because these take less space to encode with UTF-8, but not only would this complicate the encoding a lot (the forbidden characters are scattered, so we have several cases to handle) but it can only lead to improvement for about 6.25% of bit patterns (fraction of Unicode characters that are in the BMP), and for the majority of that 6.25%, we'd save only one byte. For random data, the overhead decreases from 60% to around 55%. The result would still be much worse than base64 except for some very contrived data. Note that the overhead is data-dependant though. For 0.2% of bit patterns, you will actually get compression instead of overhead (60% compression for 0.012% of patterns and 20% compression for 0.18% of patterns). But these fractions are really low. It's just not worth it.
To put this another way: if you want to encode anything using 4-byte UTF-8 sequences, you need to use 32 bits per sequence (of course) but 11 of those bits are fixed and unchangeable: the bits must fit the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
and there are only 21 x
s in there). That overhead of 60% is built in to UTF-8 so if you want to use this as the basis of any encoding that improves upon the overhead of base64, you are starting from behind!
I hope this convinces you that you can't improve on the density of base64 using any scheme of this type.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With