I barely know a thing about compression, so bear with me (this is probably a stupid and painfully obvious question).
So lets say I have an XML file with a few tags.
<verylongtagnumberone> <verylongtagnumbertwo> text </verylongtagnumbertwo> </verylongtagnumberone>
Now lets say I have a bunch of these very long tags with many attributes in my multiple XML files. I need to compress them to the smallest size possible. The best way would be to use an XML-specific algorithm which assigns individual tags pseudonyms like vlt1 or vlt2. However, this wouldn't be as 'open' of a way as I m trying to go for, and I want to use a common algorithm like DEFLATE or LZ. It also helpes if the archive was a .zip file.
Since I'm dealing with plain text (no binary files like images), I'd like an algorithm that suits plain text. Which one produces the smallest file size (lossless algorithms are preferred)?
By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.
EDIT: The 'encryption' thing was a typo; it should ave ben 'compression'.
Using Custom CompressionYou can implement custom compression routine for use with you BDB XML whole document containers. When you do this, you must register the compression routine when you create and open your container, and you must always use the same compression for all subsequent uses of the container.
The winner by pure compression is 7z, which isn't surprising to us. We've seen 7z come on the top of file compression benchmarks time and time again. If you want to compress something to use as little space as possible, you should definitely use 7z.
Research into XML compression was initiated with Liefke and Suciu's development in 2000 of a compressor called XMill [6]. It is based on three principles: separating the structure from the content of the XML document, bucketing the content based on their tags, and compressing the individual buckets separately.
There is a W3 (not-yet-released) standard named EXI (Efficient XML Interchange).
Should become THE data format for compressing XML data in the future (claimed to be the last necessary binary format). Being optimized for XML, it compresses XML more ways more efficient than any conventional compression algorithm.
With EXI, you can operate on compressed XML data on the fly (without the need to uncompress or re-compress it).
EXI = (XML + XMLSchema) as binary.
And here you go with the opensource implementation (don't know if it's already stable):
Exificient
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With