Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my C# gzip produce a larger file than Fiddler or PHP?

If I GZip this text:

Hello World

through C# using this code:

Stream stream = new MemoryStream(Encoding.Default.GetBytes("Hello World")); var compressedMemoryStream = new MemoryStream(); using (var gzipStream = new GZipStream(compressedMemoryStream, CompressionMode.Compress)) {     stream.CopyTo(gzipStream);       gzipStream.Close();  }  

the resulting stream is 133 bytes long

Running the same string through either Fiddler's Utilities.GzipCompress or this PHP page the result is only 31 bytes long.

In both cases the input is 11 bytes, so I would imagine the PHP result is correct but obviously this means that I can't decompress the PHP zip from within .NET or visa-versa. Why is the .NET output so much larger?


Actually it turns out that while the result from PHP and Fiddler are the same length that they are not the same. I can decompress the PHP version in .NET, but not the Fiddler version. The PHP page decompresses all three, so it looks like there may be an incompatibility between Fiddler's and .NET's implementations of gzip.


As requested I've uploaded the three outputs to dropbox here.

And these are the raw hexdumps of those files (not sure if they are really any use like this, but I think it shows that the difference between the fiddler and PHP version is in the header, rather than the compressed data itself):

Fiddler:

0000-0010:  1f 8b 08 00-c2 e6 ff 4f-00 ff f3 48-cd c9 c9 57  .......O ...H...W 0000-001f:  08 cf 2f ca-49 01 00 56-b1 17 4a 0b-00 00 00     ../.I..V ..J.... 

PHP:

0000-0010:  1f 8b 08 00-00 00 00 00-00 03 f3 48-cd c9 c9 57  ........ ...H...W 0000-001f:  08 cf 2f ca-49 01 00 56-b1 17 4a 0b-00 00 00     ../.I..V ..J.... 

C#:

0000-0010:  1f 8b 08 00-00 00 00 00-04 00 ec bd-07 60 1c 49  ........ .....`.I 0000-0020:  96 25 26 2f-6d ca 7b 7f-4a f5 4a d7-e0 74 a1 08  .%&/m.{. J.J..t.. 0000-0030:  80 60 13 24-d8 90 40 10-ec c1 88 cd-e6 92 ec 1d  .`.$..@. ........ 0000-0040:  69 47 23 29-ab 2a 81 ca-65 56 65 5d-66 16 40 cc  iG#).*.. eVe]f.@. 0000-0050:  ed 9d bc f7-de 7b ef bd-f7 de 7b ef-bd f7 ba 3b  .....{.. ..{....; 0000-0060:  9d 4e 27 f7-df ff 3f 5c-66 64 01 6c-f6 ce 4a da  .N'...?\ fd.l..J. 0000-0070:  c9 9e 21 80-aa c8 1f 3f-7e 7c 1f 3f-22 be 9d 97  ..!....? ~|.?"... 0000-0080:  65 95 7e b7-aa cb d9 ff-13 00 00 ff-ff 56 b1 17  e.~..... .....V.. 0000-0085:  4a 0b 00 00-00 
like image 323
Martin Harris Avatar asked Jul 11 '12 14:07

Martin Harris


People also ask

Why does C drive keeps filling up?

If your C drive is filling up without a reason, it can be due to a malware attack, file system corruption etc. The C drive is usually taken as the System partition on a computer system. System volume is the partition where your Windows is installed and where all the third-party programs intend to install by default.

What is eating up my C drive space?

Find out what files are taking up space on Windows 10 Click on System. Click on Storage. Under the “(C:)” section, you will see what's taking up space on the main hard drive. Click the Show more categories option to view the storage usage from other file types.

What happens if your C drive is full?

In case the C drive memory space is full, then you have to move the unused data to a different drive and uninstall the installed applications which are not used frequently. You can also perform Disk Cleanup to reduce the number of unnecessary files on the drives, which can help the computer run faster.


2 Answers

Preface: .NET users should not use the Microsoft-provided GZipStream or DeflateStream classes under any circumstances, unless Microsoft replaces them completely with something that works. Use the DotNetZip library instead.

Update to Preface: The .NET Framework 4.5 and later have fixed the compression problem, and GZipStream and DeflateStream use zlib in those versions. I do not know if the CRC problem referenced below has been fixed.

Another update: The CRC problem is not only not fixed, but Microsoft has decided that they won't fix it!

This is one of several bugs in GZipStream. No self-respecting gzip compressor should ever produce 133 bytes of output from 11 bytes of input. See my comments at Why does BCL GZipStream (with StreamReader) not reliably detect Data Errors with CRC32? .

What is happening internally is that GZipStream is not using the static or stored methods, both of which would produce compressed data about the same size as the input data (on top of which would be added 18 bytes of gzip header and trailer). Instead it is using the dynamic method, which creates a very large code descriptor header for a very small number of codes. It is simply a bug / very bad implementation.

Update:

With the hex dumps, I can provide some analysis. First, both the Fiddler and php output are correct and proper. The only difference between them is in the gzip header, in particular the timestamp set in Fiddler but not in php, and the originating operating system set in php but not in Fiddler. For both the 13 bytes of compressed data is identical, and can be represented as (using my infgen program to disassemble deflate streams):

last static literal 'Hello World end 

which is exactly as it should be. A single static block, which requires no code descriptors, and simply coding all of the bytes as literals. (No matches of previous strings with lengths and distances.)

The output of GZipStream on the other hand is a horrible mess in several ways. The compressed data is:

dynamic code 3 5 code 4 5 code 5 4 code 6 4 code 7 4 code 8 3 code 9 3 code 10 4 code 11 4 code 12 4 code 13 4 code 14 3 code 16 3 litlen 0 14 litlen 1 14 litlen 2 14 litlen 3 14 litlen 4 14 litlen 5 14 litlen 6 14 litlen 7 14 litlen 8 14 litlen 9 12 litlen 10 6 litlen 11 14 litlen 12 14 litlen 13 14 litlen 14 14 litlen 15 14 litlen 16 14 litlen 17 14 litlen 18 14 litlen 19 14 litlen 20 14 litlen 21 14 litlen 22 14 litlen 23 14 litlen 24 14 litlen 25 14 litlen 26 14 litlen 27 14 litlen 28 14 litlen 29 14 litlen 30 13 litlen 31 14 litlen 32 6 litlen 33 14 litlen 34 10 litlen 35 12 litlen 36 14 litlen 37 14 litlen 38 13 litlen 39 10 litlen 40 8 litlen 41 9 litlen 42 11 litlen 43 10 litlen 44 7 litlen 45 8 litlen 46 7 litlen 47 9 litlen 48 8 litlen 49 8 litlen 50 8 litlen 51 9 litlen 52 8 litlen 53 9 litlen 54 10 litlen 55 9 litlen 56 8 litlen 57 9 litlen 58 9 litlen 59 8 litlen 60 9 litlen 61 10 litlen 62 8 litlen 63 14 litlen 64 14 litlen 65 8 litlen 66 9 litlen 67 8 litlen 68 9 litlen 69 8 litlen 70 9 litlen 71 10 litlen 72 11 litlen 73 8 litlen 74 11 litlen 75 14 litlen 76 9 litlen 77 10 litlen 78 9 litlen 79 10 litlen 80 9 litlen 81 12 litlen 82 9 litlen 83 9 litlen 84 9 litlen 85 10 litlen 86 12 litlen 87 11 litlen 88 14 litlen 89 14 litlen 90 12 litlen 91 11 litlen 92 14 litlen 93 11 litlen 94 14 litlen 95 14 litlen 96 14 litlen 97 6 litlen 98 7 litlen 99 7 litlen 100 7 litlen 101 6 litlen 102 8 litlen 103 8 litlen 104 7 litlen 105 6 litlen 106 12 litlen 107 9 litlen 108 6 litlen 109 7 litlen 110 7 litlen 111 6 litlen 112 7 litlen 113 13 litlen 114 6 litlen 115 6 litlen 116 6 litlen 117 7 litlen 118 8 litlen 119 8 litlen 120 9 litlen 121 8 litlen 122 11 litlen 123 13 litlen 124 12 litlen 125 13 litlen 126 13 litlen 127 14 litlen 128 14 litlen 129 14 litlen 130 14 litlen 131 14 litlen 132 14 litlen 133 14 litlen 134 14 litlen 135 14 litlen 136 14 litlen 137 14 litlen 138 14 litlen 139 14 litlen 140 14 litlen 141 14 litlen 142 14 litlen 143 14 litlen 144 14 litlen 145 14 litlen 146 14 litlen 147 14 litlen 148 14 litlen 149 14 litlen 150 14 litlen 151 14 litlen 152 14 litlen 153 14 litlen 154 14 litlen 155 14 litlen 156 14 litlen 157 14 litlen 158 14 litlen 159 14 litlen 160 14 litlen 161 14 litlen 162 14 litlen 163 14 litlen 164 14 litlen 165 14 litlen 166 14 litlen 167 14 litlen 168 14 litlen 169 14 litlen 170 14 litlen 171 14 litlen 172 14 litlen 173 14 litlen 174 14 litlen 175 14 litlen 176 14 litlen 177 14 litlen 178 14 litlen 179 14 litlen 180 14 litlen 181 14 litlen 182 14 litlen 183 14 litlen 184 14 litlen 185 14 litlen 186 14 litlen 187 14 litlen 188 14 litlen 189 14 litlen 190 14 litlen 191 14 litlen 192 14 litlen 193 14 litlen 194 14 litlen 195 14 litlen 196 14 litlen 197 14 litlen 198 14 litlen 199 14 litlen 200 14 litlen 201 14 litlen 202 14 litlen 203 14 litlen 204 14 litlen 205 14 litlen 206 14 litlen 207 14 litlen 208 14 litlen 209 14 litlen 210 14 litlen 211 14 litlen 212 14 litlen 213 14 litlen 214 14 litlen 215 14 litlen 216 14 litlen 217 14 litlen 218 14 litlen 219 14 litlen 220 14 litlen 221 14 litlen 222 14 litlen 223 14 litlen 224 14 litlen 225 14 litlen 226 14 litlen 227 14 litlen 228 14 litlen 229 14 litlen 230 14 litlen 231 14 litlen 232 14 litlen 233 14 litlen 234 14 litlen 235 14 litlen 236 14 litlen 237 14 litlen 238 14 litlen 239 14 litlen 240 14 litlen 241 14 litlen 242 14 litlen 243 13 litlen 244 13 litlen 245 13 litlen 246 14 litlen 247 13 litlen 248 14 litlen 249 13 litlen 250 14 litlen 251 13 litlen 252 14 litlen 253 14 litlen 254 14 litlen 255 14 litlen 256 14 litlen 257 4 litlen 258 3 litlen 259 4 litlen 260 4 litlen 261 4 litlen 262 5 litlen 263 5 litlen 264 5 litlen 265 5 litlen 266 5 litlen 267 6 litlen 268 6 litlen 269 5 litlen 270 6 litlen 271 7 litlen 272 8 litlen 273 8 litlen 274 9 litlen 275 10 litlen 276 9 litlen 277 10 litlen 278 12 litlen 279 11 litlen 280 12 litlen 281 14 litlen 282 14 litlen 283 14 litlen 284 12 litlen 285 11 dist 0 6 dist 1 10 dist 2 11 dist 3 11 dist 4 9 dist 5 8 dist 6 8 dist 7 8 dist 8 7 dist 9 7 dist 10 5 dist 11 6 dist 12 4 dist 13 5 dist 14 4 dist 15 5 dist 16 4 dist 17 5 dist 18 4 dist 19 4 dist 20 4 dist 21 4 dist 22 4 dist 23 4 dist 24 4 dist 25 5 dist 26 4 dist 27 5 dist 28 5 dist 29 5 literal 'Hello World end ! last stored end 

So what is all that? The actual data is just the line near the end "literal 'Hello World", which just codes each byte of the input. What precedes it is a description of a set of Huffman codes for literals, lengths, and distances. Here are the things wrong with it:

  • First off, it should not be using dynamic at all. Describing the set of codes takes about 100 bytes. This is precisely why the deflate format provides a pre-defined set of codes used in static blocks. The compressor should select a static block in this case (which is what php and Fiddler are doing).
  • Second, every single possible code is defined, even though the vast majority are never used! When using a dynamic block, a proper compressor will only define codes for literals, lengths, and distances actually used in that block. In this case there are no lengths or distances used, and only eight different literals used (H, e, l, o, space, w, r, and d). Instead it proceeds to define 256 literal codes, 29 length codes, and 30 distance codes. I am guessing that some experimentation will show that the dynamic header from GZipStream is always the same, in which case it's not even dynamic, which is the whole point!
  • Third, it throws in an unnecessary empty stored block at the end. The first block should have been marked as the last block.

All of this points to the simple fact that whoever wrote this GZipStream code was, to put it as politely as I can, lacking in any understanding of the deflate format or compression in general. They elected to produce only dynamic blocks (except for an empty static block at the end), to only produce the same dynamic header every time (I think), defeating the purpose of dynamic blocks, and to not bother to figure out if the current block is last one, requiring putting out an empty block to mark the end.

As noted elsewhere, those aren't the only problems with GZipStream. It can't even properly use the CRC-32 as intended to detect corrupt streams.

The truly perplexing thing is not why Microsoft assigned someone incompetent to write a gzip compressor and decompressor, but rather why they assigned anyone at all to write it! There is freely available code, zlib, that has an extremely liberal license that permits commercial use with no attribution. This code has been deployed widely for almost two decades, and does all the things it's supposed to do correctly and efficiently. Most everything else uses zlib, including php and I suspect Fiddler as well.

like image 72
Mark Adler Avatar answered Sep 25 '22 15:09

Mark Adler


GZipStream adds a 10-byte header and a 8-byte footer to the compressed data as described in the RFC 1952 specifications. This gives a result that is 133 bytes long.

The PHP page you linked to also adds the same 18-byte header/footer if asked to (GZIP-compatible encoding?). If you use that it gives a result that is 31 bytes long.

Without the header/footer the difference between them is 125 versus 13 bytes.

like image 33
Henrik Ripa Avatar answered Sep 25 '22 15:09

Henrik Ripa