I have a gzip file containing a txt file that needs to be cleaned up. I would like to read from the GZipped file line by line and then write the cleaned content to an output GZIP file all in one shot like this:
void ExtractAndFix(string inputPath, string outputPath) {
StringBuilder sbLine = new StringBuilder();
using (GZipStream gzInput = new GZipStream(new FileStream(inputPath, FileMode.Open), System.IO.Compression.CompressionMode.Decompress)) {
using (StreamReader reader = new StreamReader(gzInput, Encoding.UTF8)) {
using (GZipOutputStream gzipWriter = new GZipOutputStream(new FileStream(outputPath, FileMode.Create))) {
string line = null;
while ((line = reader.ReadLine()) != null) {
sbLine.Clear();
sbLine.Append(line.Replace("\t", " "));
sbLine.Append("\r\n");
byte[] bytes = Encoding.UTF8.GetBytes(sbLine.ToString());
gzipWriter.Write(bytes, 0, bytes.Length);
}
}
}
}
}
But for some reason that call to line = reader.ReadLine() in the while loop ONLY reads once and then returns null (reader EOS = true). I've tried this both with the native C# compression library and with the ICSharpCode package as well and I get the same behavior. I realize I could always just extract the full file, then clean it, then re-compress it, but i hate having to waste the resources, hard drive space etc. Note: these are large files (up to several GB compressed) so anything with MemoryStream is not going to be a good solution. Has anyone encountered anything odd like this before? Thank you.
After a lot of hair pulling I appear to have found the issue. For me the problem was further compounded by the fact that certain GZip files would work fine while others would display the behavior above. For example, if I created the archive myself with GZip it would work great, but certain other archives generated from other sources would not.
In short, the .NET GZip library is garbage, don't use it. In addition, the ICSharpCode library I was using was a couple years old. I'm not sure if it used to piggyback on the underlying .NET code or not, but the version I had previously (0.85.4) gave the exact same behavior. When I upgraded to the latest version (0.86.0) it worked as expected and I was able to read the full file as expected.
Hopefully this helps someone else with the same issue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With