Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding binary data within XML: Are there better alternatives than base64?

I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Which means that the unallowed are:

  • 29 Unicode control characters are illegal (0x00 - 0x20) ie (000xxxxx) except 0x09, 0x0A, 0x0D
  • Any Unicode character representation above 2 bytes (UTF-16+) is illegal (U+D800 - U+DFFF) ie (11011xxx)
  • The special Unicode noncharacters are illegal (0xFFFE - 0xFFFF) ie (11111111 1111111x)
  • <, >, & according to this post for entities content

1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.

Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).

So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)

NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.

NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).

like image 875
KrisWebDev Avatar asked Jun 25 '13 15:06

KrisWebDev


People also ask

Can XML contain binary data?

There are two methods of encoding binary data in an XML document. The base64Binary encoding makes better use of the available XML characters, and on average a base64-encoded binary field is 2/3 the size of its hexBinary equivalent. Base64Binary is widely used by the MIME format and by various XML-based standards.

Why use Base64 instead of binary?

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.

What is the difference between binary and Base64?

Base64 image representation can be directly placed within html to render an image. Binary takes up less space. And benefits from greater network effects and standardization. E.g. if you want to use amazon simple secure storage S3 you have to store a binary file.

Is Base64 still used?

Base64 is commonly used in a number of applications including email via MIME, and storing complex data in XML. One common application of Base64 encoding on the web is to encode binary data so it can be included in a data: URL.


2 Answers

Thank you Aya for the Asci85 link, there are very good ideas.

I developed them below for our case.


UTF-8 characters possibilites:


For 1 byte characters (0xxxxxxx): 96 possibilites per byte

  • + UTF-8 ASCII chars 0xxxxxxx = +2^7
  • - UTF-8 Control chars 000xxxxx = -2^5
  • + XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
  • - XML entity unallowed chars (<, >, &) = -3

EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...

For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes

  • + UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
  • - UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7

For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes

  • + UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
  • - UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
  • - Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11

And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.


The coding possibilites in let's say a 3-byte space


So let's see what combinations we can do on a 3-byte (24bit) space:

  • 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
  • 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
  • 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
  • 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities

There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).

Total number of possibilities:

  • A 3-byte space has 2^24 = 16777216 possibilities.
  • UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.

How much overhead does that mean?

  • 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
  • This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
  • That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is 18% overhead. Way better than Base64's 33% overhead and Ascii85's 25% overhead!

EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.


How to get near this 18% overhead?


The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.

Good coding lengths candidates:

  • 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
  • 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
  • 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
  • 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)

NB: 0,84 is the average "usefulness" of a space bit (20,32/24).


How to build our encoding algorithm?


We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).

The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).

Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.

That's the next challenge for this question. Who wants to to play? :)


A proposal of algorithm, I'll name BaseUTF-8 for XML


20 binary bits to encode: ABCDEFGHIJKLMNOPQRST

Resulting UTF-8 string named "encoded": 24 bits long

Mathematical encoding algorithm (not based on any known programming language):

If GH != 00 && NO != 00:
    encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)

If ABCD != 0000:
    If GH == 00 && NO == 00: # 16 bits to encode
        encoded = 0010ABCD 01EFIJKL 01MPQRST    
    Else If GH == 00:  # 18 bits to encode, 18 space bits with restrictions (1-byte  UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
        encoded = 0NOFIJKL 110ABCDE 10MPQRST
    Else If NO == 00:  # 18 bits to encode
        encoded = 110ABCDE 10MPQRST 0GHFIJKL

If ABCD == 0000: # 16 bits to encode
    encoded = 0011EFGH 01IJKLMN 01OPQRST

On "encoded" variable apply:
    convert < (0x3C) to Line Feed (0x0A)
    convert > (0x3E) to Cariage Return (0x0D)
    convert & (0x26) to TAB (0x09)

And that's how you get a 20% overhead only.

This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).

like image 148
KrisWebDev Avatar answered Nov 11 '22 13:11

KrisWebDev


It's worse than that: you don't actually have 215 different byte values you can use. The resulting binary data have to be valid in whatever encoding the XML is represented in (which is almost certainly UTF-8), which means that many, many byte sequences are forbidden. 0xc2 followed by 0x41 would be just one random example. XML is text (a sequence of Unicode characters), not binary data. When transmitted, it is encoded using some encoding (which is almost alwats UTF-8). If you try to treat it as binary data, then you are, in my opinion, asking for way more trouble than it's worth dealing with.

If you still want to do this...

XML is text. So let's not try to encode your binary data as binary data. That will not lead to an easy or obvious way to showhorn it into an XML document. Let's try instead encoding your binary data as text!

Let's try one very simple encoding:

  • Group your binary data into blocks of 20 bits
  • Encode each group of 20 bits as the Unicode character U+10000 plus the numeric value of the 20 bits.

This will mean you exclusively use characters from planes 1 through 16. All of the restricted characters are in plane 0 (the BMP), so you are safe here.

When you then encode this XML document as UTF-8 for transmission, each of these characters will require 4 bytes to encode. So you consume 32 bits for every 20 bits of original data, which is 60% overhead compared to pure binary encoding of the original data. This is worse than base64's 33%, which makes it a terrible idea.

This encoding scheme is slightly wasteful because it makes no use of BMP characters. Can we use BMP characters to make it better? Not trivially. 20 is the largest size we can use for the groups (log(0x10FFFF) ~ 20.09). We could remap out scheme to use some as manu BMP characters as possible because these take less space to encode with UTF-8, but not only would this complicate the encoding a lot (the forbidden characters are scattered, so we have several cases to handle) but it can only lead to improvement for about 6.25% of bit patterns (fraction of Unicode characters that are in the BMP), and for the majority of that 6.25%, we'd save only one byte. For random data, the overhead decreases from 60% to around 55%. The result would still be much worse than base64 except for some very contrived data. Note that the overhead is data-dependant though. For 0.2% of bit patterns, you will actually get compression instead of overhead (60% compression for 0.012% of patterns and 20% compression for 0.18% of patterns). But these fractions are really low. It's just not worth it.

To put this another way: if you want to encode anything using 4-byte UTF-8 sequences, you need to use 32 bits per sequence (of course) but 11 of those bits are fixed and unchangeable: the bits must fit the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx and there are only 21 xs in there). That overhead of 60% is built in to UTF-8 so if you want to use this as the basis of any encoding that improves upon the overhead of base64, you are starting from behind!

I hope this convinces you that you can't improve on the density of base64 using any scheme of this type.

like image 24
Celada Avatar answered Nov 11 '22 14:11

Celada