I am trying to download a .gz file of a few hundred MBs, and turn it into a very long string in C#.
using (var memstream = new MemoryStream(new WebClient().DownloadData(url)))
using (GZipStream gs = new GZipStream(memstream, CompressionMode.Decompress))
using (var outmemstream = new MemoryStream())
{
gs.CopyTo(outmemstream);
string t = Encoding.UTF8.GetString(outmemstream.ToArray());
Console.WriteLine(t);
}
My test URL: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-47/segments/1510934803848.60/wat/CC-MAIN-20171117170336-20171117190336-00002.warc.wat.gz
memstream has a length of 283063949. The program lingers for about 15 seconds on the line where it is initialized, and my network is floored during it, which makes sense.
outmemstream has a length of only 548.
Written to the command line is the first lines of the zipped document. They are not garbled. I'm not sure how to get the rest.
The .NET GZipStream
unpacks the first 548 bytes of the plain text, which is all of the first record in the file. 7Zip extracts the whole file to a 1.2GB output file, but it is plain text (about 1.3 million lines worth) with no record separators, and when I test the file in 7Zip it reports 1,441 bytes.
I checked a few things and couldn't find a single compression library that would unpack this thing directly.
After a bit of casting about in the file I found that 1,441 bytes is the value of ISIZE
which is normally the last 4 bytes of the gzip file, part of an 8-byte footer record that is appended to the compressed data chunks.
It turns out that what you have is a big set of .gz files concatenated together. And while that's a complete pain in the butt, there are a few ways you can approach this.
The first is to scan the compressed file for the gzip header signature bytes: 0x1F
and 0x8B
. When you locate these you will (usually) have the start of each .gz file in the stream. You can build a list of offsets in the file and then extract each chunk of the file and decompress it.
Another option is to use a library that will report the number of bytes consumed from the input stream. Since almost all decompressors use buffering of some sort you will find that the input stream will move much further than the number of bytes consumed, so this is difficult to guess at directly. The DotNetZip
streams however will give you the actual consumed input bytes, which you can use to figure out the next starting position. This will allow you to process the file as a stream and extract each file individually.
Either way, not fast.
Here's a method for the second option, using the DotNetZip
library:
public static IEnumerable<byte[]> UnpackCompositeFile(string filename)
{
using (var fstream = File.OpenRead(filename))
{
long offset = 0;
while (offset < fstream.Length)
{
fstream.Position = offset;
byte[] bytes = null;
using (var ms = new MemoryStream())
using (var unpack = new Ionic.Zlib.GZipStream(fstream, Ionic.Zlib.CompressionMode.Decompress, true))
{
unpack.CopyTo(ms);
bytes = ms.ToArray();
// Total compressed bytes read, plus 10 for GZip header, plus 8 for GZip footer
offset += unpack.TotalIn + 18;
}
yield return bytes;
}
}
}
It's ugly and not fast (took me about 48 seconds to decompress the whole file) but it appears to work. Each byte[]
output represents a single compressed file in the stream. These can be turned into strings with System.Text.Encoding.UTF8.GetString(...)
and then parsed to extract the meaning.
The last item in the file looks like this:
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: https://zverek-shop.ru/dljasobak/ruletka_sobaki/ruletka-tros_standard_5_m_dlya_sobak_do_20_kg
WARC-Date: 2017-11-25T14:16:01Z
WARC-Record-ID: <urn:uuid:e19ef645-b057-4305-819f-7be2687c3f19>
WARC-Refers-To: <urn:uuid:df5de410-d4af-45ce-b545-c699e535765f>
Content-Type: application/json
Content-Length: 1075
{"Container":{"Filename":"CC-MAIN-20171117170336-20171117190336-00002.warc.gz","Compressed":true,"Offset":"904209205","Gzip-Metadata":{"Inflated-Length":"463","Footer-Length":"8","Inflated-CRC":"1610542914","Deflate-Length":"335","Header-Length":"10"}},"Envelope":{"Format":"WARC","WARC-Header-Length":"438","Actual-Content-Length":"21","WARC-Header-Metadata":{"WARC-Target-URI":"https://zverek-shop.ru/dljasobak/ruletka_sobaki/ruletka-tros_standard_5_m_dlya_sobak_do_20_kg","WARC-Warcinfo-ID":"<urn:uuid:283e4862-166e-424c-b8fd-023bfb4f18f2>","WARC-Concurrent-To":"<urn:uuid:ca594c00-269b-4690-b514-f2bfc39c2d69>","WARC-Date":"2017-11-17T17:43:04Z","Content-Length":"21","WARC-Record-ID":"<urn:uuid:df5de410-d4af-45ce-b545-c699e535765f>","WARC-Type":"metadata","Content-Type":"application/warc-fields"},"Block-Digest":"sha1:4SKCIFKJX5QWLVICLR5Y2BYE6IBVMO3Z","Payload-Metadata":{"Actual-Content-Type":"application/metadata-fields","WARC-Metadata-Metadata":{"Metadata-Records":[{"Value":"1140","Name":"fetchTimeMs"}]},"Actual-Content-Length":"21","Trailing-Slop-Length":"0"}}}
This is the record that occupies 1,441 bytes, including the two blank lines after it.
Just for the sake of completeness...
The TotalIn
property returns the number of compressed bytes read, not including the GZip header and footer. In the code above I use a constant 18 bytes for the header and footer size, which is the minimum size of these for GZip. While that works for this file, anyone else dealing with concatenated GZip files may find that there is additional data in the header that makes it larger, which will stop the above from working.
In this case you have two options:
DeflateStream
to decompress.TotalIn + 18
bytes.Either should work without slowing you down too much. Since buffering is happening in the decompression code you're going to have to seek the stream backwards after each segment, so reading some additional bytes doesn't slow you down too much.
That is a valid gzip stream, de-compressible by gzip. Per the standard (RFC 1952), a concatenation of valid gzip streams is also a valid gzip stream. Your file is a concatenation of 118,644 (!) atomic gzip streams. The first atomic gzip stream is 382 bytes long, and results in 548 uncompressed bytes. That's all you're getting.
Apparently the GzipStream
class has a bug in that it does not look for another atomic gzip stream after it completes the decompression of the first one, and so is not abiding by RFC 1952. You can just do that yourself in a loop, until you reach the end of the input file.
As a side note, the small size of each gzip stream in your file is rather inefficient. The compressor needs more data than that to get rolling. If that data is compressed as a single atomic gzip stream, it compresses to 195,606,385 bytes instead of 283,063,949 bytes. It would compress to about the same size even with many pieces, so long as the pieces were more like a megabyte in size or more, as opposed to the hundreds to the average 10K bytes per piece you have there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With