Recover corrupt zip or gzip files?

Tags:

The most common method for corrupting compressed files is to inadvertently do an ASCII-mode FTP transfer, which causes a many-to-one trashing of CR and/or LF characters.

Obviously, there is information loss, and the best way to fix this problem is to transfer again, in FTP binary mode.

However, if the original is lost, and it's important, how recoverable is the data?

[Actually, I already know what I think is the best answer (it's very difficult but sometimes possible - I'll post more later), and the common non-answers (lots of off-the-shelf programs for repairing CRCs without repairing data), but I thought it would be interesting to try out this question during the stackoverflow beta period, and see if anyone else has gone down the successful-recovery path or discovered tools I don't know about.]

785

asked Sep 12 '08 18:09

Liudvikas Bukys

2 Answers

From Bukys Software

Approximately 1 in 256 bytes is known to be corrupted, and the corruption is known to occur only in bytes with the value '\012'. So the byte error rate is 1/256 (0.39% of input), and 2/256 bytes (0.78% of input) are suspect. But since only three bits per smashed byte are affected, the bit error rate is only 3/(256*8): 0.15% is bad, 0.29% is suspect.

...

An error in the compressed input disrupts the decompression process for all subsequent bytes...The fact that the decompressed output is recognizably bad so quickly is cause for hope -- a search for the correct answer can identify wrong answers quickly.

Ultimately, several techniques were combined to successfully extract reasonable data from these files:

Domain-specific parsing of fields and quoted strings

Machine learning from previous data with low probability of damage

Tolerance for file damage due to other causes (e.g. disk full while logging)

Lookahead for guiding the search along the highest-probability paths

These techniques identify 75% of the necessary repairs with certainty, and the remainder are explored highest-probability-first, so that plausible reconstructions are identified immediately.

115

answered Sep 30 '22 16:09

Adam Davis

You could try writing a little script to replace all of the CRs with CRLFs (assuming the direction of trashing was CRLF to CR), swapping them randomly per block until you had the correct crc. Assuming that the data wasn't particularly large, I guess that might not use all of your CPU until the heat death of the universe to complete.

As there is definite information loss, I don't know that there is a better way. Loss in the CR to CRLF direction might be slightly easier to roll back.

answered Sep 30 '22 14:09

jsight

Related questions
                            
                                What's a good way to find relative paths in Google App Engine?
                            
                                How can I find the current DNS server?
                            
                                "Could not find file" when using Isolated Storage
                            
                                PHP Deployment to windows/unix servers
                            
                                How to convert all controls on an aspx webform to a read-only equivalent
                            
                                How to skip sys.exitfunc when unhandled exceptions occur
                            
                                How do I send an email attachment using the designated client, programmatically from Java
                            
                                Converting std::vector<>::iterator to .NET interface in C++/CLI
                            
                                Rich Edit Control in raw Win32
                            
                                How do you get an embedded Jetty webserver to dump its interim Java code for JSPs
                            
                                Where is the best place to store user related data in asp.net?
                            
                                C# COM Office Automation - RPC_E_SYS_CALL_FAILED

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With