Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the position of Central Directory in a Zip file?

Tags:

format

zip

I am trying to find the position of the first Central Directory file header in a Zip file.

I'm reading these: http://en.wikipedia.org/wiki/Zip_(file_format) http://www.pkware.com/documents/casestudies/APPNOTE.TXT

As I see it, I can only scan through the Zip data, identify by the header what kind of section I am at, and then do that until I hit the Central Directory header. I would obviously read the File Headers before that and use the "compressed size" to skip the actual data, and not for-loop through every byte in the file...

If I do it like that, then I practically already know all the files and folders inside the Zip file in which case I don't see much use for the Central Directory anymore.

To my understanding the purpose of Central Directory is to list file metadata, and the position of the actual data in the Zip file so you wouldn't need to scan the whole file?

After reading about End Of Central Directory record, Wikipedia says:

This ordering allows a zip file to be created in one pass, but it is usually decompressed by first reading the central directory at the end.

How would I find End of Central Directory record easily? We need to remember that it can have an arbitrary sized comment there, so I may not know how many bytes from the end of the data stream it is located at. Do I just scan it?

P.S. I'm writing a Zip file reader.

like image 505
Tower Avatar asked Dec 21 '11 17:12

Tower


3 Answers

Start at the end and scan towards the beginning, looking for the end of directory signature and counting the number of bytes you have scanned. When you find a candidate, get the byte 20 offset for the comment length (L). Check if L + 20 matches your current count. Then check that the start of the central directory (pointed to by the byte 12 offset) has an appropriate signature.

If you assumed the bits were pretty random when the signature check happened to be a wild guess (e.g. a guess landing into a data segment), the probability of getting all the signature bits correct is pretty low. You could refine this and figure out the chance of landing in a data segment and the chance of hitting a legitimate header (as a function of the number of such headers), but this is already sounded like a low likelihood to me. You could increase your confidence level by then checking the signature of the first file record listed, but be sure to handle the boundary case of an empty zip file.

like image 53
Derek E Avatar answered Oct 07 '22 07:10

Derek E


I ended up looping through the bytes starting from the end. The loop stops if it finds a matching byte sequence, the index is below zero or if it already went through 64k bytes.

like image 26
Tower Avatar answered Oct 07 '22 06:10

Tower


Just cross your fingers and hope that there isn't an entry with the CRC, timestamp or datestamp as 06054B50, or any other sequence of four bytes that happen to be 06054B50.

like image 21
user2624417 Avatar answered Oct 07 '22 05:10

user2624417