Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best compression algorithm for XML?

I barely know a thing about compression, so bear with me (this is probably a stupid and painfully obvious question).

So lets say I have an XML file with a few tags.

<verylongtagnumberone>   <verylongtagnumbertwo>     text   </verylongtagnumbertwo> </verylongtagnumberone> 

Now lets say I have a bunch of these very long tags with many attributes in my multiple XML files. I need to compress them to the smallest size possible. The best way would be to use an XML-specific algorithm which assigns individual tags pseudonyms like vlt1 or vlt2. However, this wouldn't be as 'open' of a way as I m trying to go for, and I want to use a common algorithm like DEFLATE or LZ. It also helpes if the archive was a .zip file.

Since I'm dealing with plain text (no binary files like images), I'd like an algorithm that suits plain text. Which one produces the smallest file size (lossless algorithms are preferred)?

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

EDIT: The 'encryption' thing was a typo; it should ave ben 'compression'.

like image 871
Aethex Avatar asked Jul 04 '09 14:07

Aethex


People also ask

Can XML be compressed?

Using Custom CompressionYou can implement custom compression routine for use with you BDB XML whole document containers. When you do this, you must register the compression routine when you create and open your container, and you must always use the same compression for all subsequent uses of the container.

Which is the best compression algorithm?

The winner by pure compression is 7z, which isn't surprising to us. We've seen 7z come on the top of file compression benchmarks time and time again. If you want to compress something to use as little space as possible, you should definitely use 7z.

What is XML compression?

Research into XML compression was initiated with Liefke and Suciu's development in 2000 of a compressor called XMill [6]. It is based on three principles: separating the structure from the content of the XML document, bucketing the content based on their tags, and compressing the individual buckets separately.


1 Answers

There is a W3 (not-yet-released) standard named EXI (Efficient XML Interchange).

Should become THE data format for compressing XML data in the future (claimed to be the last necessary binary format). Being optimized for XML, it compresses XML more ways more efficient than any conventional compression algorithm.

With EXI, you can operate on compressed XML data on the fly (without the need to uncompress or re-compress it).

EXI = (XML + XMLSchema) as binary.

And here you go with the opensource implementation (don't know if it's already stable):
Exificient

like image 108
ivan_ivanovich_ivanoff Avatar answered Sep 19 '22 21:09

ivan_ivanovich_ivanoff