Compression algorithms specifically optimized for HTML content?

Tags:

Are there any compression algorithms -- lossy or lossless -- that have been specifically adapted to deal with real-world (messy and invalid) HTML content?

If not, what characteristics of HTML could we take advantage of to create such an algorithm? What are the potential performance gains?

Also, I'm not asking the question to serve such content (via Apache or any other server), though that's certainly interesting, but to store and analyze it.

Update: I don't mean GZIP -- that's obvious -- but rather an algorithm specifically designed to take advantage of characteristics of HTML content. For example, the predictible tag and tree structure.

366

asked Mar 10 '10 17:03

hmason

1 Answers

I do not know of "off-the-shelf" compression library explicitly optimized for HTML content.

Yet, HTML text should compress quite nicely with generic algorithms (do read the bottom of this answer for better algorithms). Typically all variations on Lempel–Ziv perform well on HTML-like languages, owing to the highly repeatitive of specific language idioms; GZip, often cited uses such a LZ-based algoritm (LZ77, I think).

An idea to maybe improve upon these generic algorithms would be to prime a LZ-type circular buffer with the most common html tags and patterns at large. In this fashion, we'd reduce the compressed size by using citations from the very first instance of such a pattern. This gain would be particularly sensitive on smaller html documents.

A complementary, similar, idea, is to have the compression and decompression methods imply (i.e. not send) the info for other compression's algorithm of an LZ-x algorithm (say the Huffman tree in the case of LZH etc.), with statistics specific to typical HTML being careful to exclude from characters count the [statistically weighted] instances of character encoded by citation. Such a filtered character distribution would probably become closer to that of plain English (or targeted web sites' national languge) than the complete HTML text.

Unrelated to the above [educated, I hope] guesses, I started searching the web for information on this topic.

' found this 2008 scholarly paper (pdf format) by Przemysław Skibiński of University of Wrocław. The paper's abstract indicates a 15% improvement over GZIP, with comparable compression speed.

I may be otherwise looking in the wrong places. There doesn't seem to be much interest for this. It could just be that the additional gain, relative to a plain or moderately tuned generic algorithm wasn't deemed sufficient enough to warrant such interest, even in the early days of Web-enabled cell phones (when bandwidth was at quite a premium...).

118

answered Sep 17 '22 07:09

mjv

Related questions
                            
                                Cross-browser Webp images support
                            
                                How can I replace text dynamically time after time
                            
                                How to use height: -webkit-fill-available in Edge browser? How to make a div fill the available space in Edge?
                            
                                How to replace Bootstrap dropdown-toggle icon with another default icon?
                            
                                How to disable body scrolling when modal is open IOS only
                            
                                How can I create a wavy shape CSS?
                            
                                Is it possible to set the gradient start and end positions in a CSS background similar to SVG start and end positions?
                            
                                Why are HTML attributes set differently into the DOM?
                            
                                Tailwind h-screen doesn’t work properly on mobile devices
                            
                                Send input value to typescript (without button send)
                            
                                Browser tab badge notification
                            
                                Angular mat-form-filed input disable autocomplete
                            
                                Sources of inspiration for navigation breadcrumbs
                            
                                asp.net mvc and css: Having menu tab stay highlighted on selection
                            
                                How to disable file input text box in IE?
                            
                                Is it good practice to use typographic quotes on web pages?
                            
                                python method to extract content (excluding navigation) from an HTML page
                            
                                tracking pixel or javascript include?
                            
                                How to prevent page's scroll position reset after DOM manipulation?
                            
                                Text not wrapping in label tag

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Compression algorithms specifically optimized for HTML content?

Tags:

html

algorithm

compression

hmason

People also ask

1 Answers

mjv

Recent Activity

Donate For Us