Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read from file containing multiple GzipStreams

I've got a file created with code which looks like this:

        using (var fs=File.OpenWrite("tmp"))
        {
            using (GZipStream gs=new GZipStream(fs,CompressionMode.Compress,true))
            {
                using (StreamWriter sw=new StreamWriter(gs))
                {
                    sw.WriteLine("hello ");
                }
            }

            using (GZipStream gs = new GZipStream(fs, CompressionMode.Compress, true))
            {
                using (StreamWriter sw = new StreamWriter(gs))
                {
                    sw.WriteLine("world");
                }
            }
        }

Now I'm trying to read the data from this file with following code:

        string txt;

        using (var fs=File.OpenRead("tmp"))
        {
            using (GZipStream gs=new GZipStream(fs,CompressionMode.Decompress,true))
            {
                using (var rdr = new StreamReader(gs))
                {
                    txt = rdr.ReadToEnd();
                }
            }

            using (GZipStream gs = new GZipStream(fs, CompressionMode.Decompress, true))
            {
                using (StreamReader sr = new StreamReader(gs))
                {
                    txt+=sr.ReadToEnd();
                }
            }
        }

The first stream reads ok, but the second stream doesn't read.

How can I read the second stream?

like image 599
Arsen Zahray Avatar asked Mar 07 '13 18:03

Arsen Zahray


1 Answers

This is a problem with the way GzipStream handles gzip files with multiple gzip entries. It reads the first entry, and treats all succeeding entries as garbage (interestingly, utilities like gzip and winzip handle it correctly by extracting them all into one file).There are a couple of workarounds, or you can use a third-party utility like DotNetZip (http://dotnetzip.codeplex.com/).

Perhaps the easiest is to scan the file for all of the gzip headers, and then manually moving the stream to each one and decompressing the content. This can be done by looking for the ID1, ID2, and 0x8 in the raw file bytes (Deflate compression method, see the specification: http://www.gzip.org/zlib/rfc-gzip.html). This isn't always enough to guarantee that you're looking at a gzip header, so you would want to read the rest of the header (or at least the first ten bytes) in to verify:

    const int Id1 = 0x1F;
    const int Id2 = 0x8B;
    const int DeflateCompression = 0x8;
    const int GzipFooterLength = 8;
    const int MaxGzipFlag = 32; 

    /// <summary>
    /// Returns true if the stream could be a valid gzip header at the current position.
    /// </summary>
    /// <param name="stream">The stream to check.</param>
    /// <returns>Returns true if the stream could be a valid gzip header at the current position.</returns>
    public static bool IsHeaderCandidate(Stream stream)
    {
        // Read the first ten bytes of the stream
        byte[] header = new byte[10];

        int bytesRead = stream.Read(header, 0, header.Length);
        stream.Seek(-bytesRead, SeekOrigin.Current);

        if (bytesRead < header.Length)
        {
            return false;
        }

        // Check the id tokens and compression algorithm
        if (header[0] != Id1 || header[1] != Id2 || header[2] != DeflateCompression)
        {
            return false;
        }

        // Extract the GZIP flags, of which only 5 are allowed (2 pow. 5 = 32)
        if (header[3] > MaxGzipFlag)
        {
            return false;
        }

        // Check the extra compression flags, which is either 2 or 4 with the Deflate algorithm
        if (header[8] != 0x0 && header[8] != 0x2 && header[8] != 0x4)
        {
            return false;
        }

        return true;
    }

Note that GzipStream might move the stream to the end of the file if you use the file stream directly. You may want to read each part into a MemoryStream and then decompress each part individually in memory.

An alternate approach would be to modify the gzip headers to specify the length of the content so that you don't have to scan the file for headers (you could programmatically determine the offset of each), which would require diving a bit deeper into the gzip spec.

like image 175
Jacob Avatar answered Sep 29 '22 07:09

Jacob