I'm downloading zipped files containing XMLs, and I'd like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip
doesn't suffice for me. There's no way to say "here's a byte array of a zip file, use it" without turning it into a stream, and ZipInputStream
is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the ZipInputStream
, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.
Assuming ZipInputStream won't work, what can I do to solve this problem in cases where there are no entry headers? I'm using Wikipedia's definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.
EDIT
The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java's) has. I'll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I've files with -1 in these fields). Thanks to centic for providing this link.
Also, let me quote the wikipedia on the subject:
Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures.
Note that ZipInputStream
scans for entries, not the central directory, which is the problem with it.
Final Edit
If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream
from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.
Methods. getComment(): String – returns the zip file comment, or null if none. getEntry(String name): ZipEntry – returns the zip file entry for the specified name, or null if not found. getInputStream(ZipEntry entry) : InputStream – Returns an input stream for reading the contents of the specified zip file entry.
To unzip a zip file, we need to read the zip file with ZipInputStream and then read all the ZipEntry one by one. Then use FileOutputStream to write them to file system. We also need to create the output directory if it doesn't exists and any nested directories present in the zip file.
Decompressing a zipped file or folderFind the file you want to decompress, right-click it, and choose Extract All. In the dialog box that appears, to choose the destination for the decompressed files, click Browse.... You can also check the option Show extracted files when complete. Click Extract.
EDIT: Another suggestion...
Looking at ZipFile
from the Apache Commons implementation, it looks like it wouldn't be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of the RandomAccessFile
API which are required (I don't think there are very many). You've already indicated that you prefer the interface to ZipFile
, so why not go with that?
We don't know enough about your project to know whether this opens up any legal questions - and even if you gave details, I doubt that anyone here would be able to give good legal advice - but I suspect it wouldn't take more than an hour or two to get this solution up and working, and I suspect you'd have reasonable confidence in it.
EDIT: This may be a slightly more productive answer...
If you're worried about the entries not being contiguous, but don't want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new ByteArrayOutputStream
, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believe ZipInputStream
will be happy with. Then write a new central directory - if you want your replacement to be valid you may need to do this from scratch, but if you're using code which you know won't actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that's probably good enough :)
Once you've done that, convert the ByteArrayOutputStream
into a new byte[]
, wrap it in a ByteArrayInputStream
and then pass that to ZipInputStream
or ZipArchiveInputStream
.
Depending on your purposes, you may not even need to do that much - you may be able to just extract each file as you go by creating a "mini" zip file with just the one entry you're reading from the directory at a time.
This does involve understanding the zip file format, but not completely - just the skeleton, effectively. It's not a quick and easy fix like using an existing API completely, but it shouldn't take very long. It doesn't guarantee it'll be able to read all invalid files (how could it?) but it will protect you against the "data between entries" issue you seem to be particularly concerned about. Hope it's at least a useful idea...
there's no way to say "here's a byte array of a zip file, use it"
Yes there is:
byte[] data = ...;
ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
ZipInputStream zipStream = new ZipInputStream(byteStream);
That leaves the issue of whether ZipInputStream
can handle all the zip files you'll give it - but I wouldn't write it off quite so quickly.
Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though ZipFile
requires a file, ZipArchiveInputStream
doesn't - so again, you could use a ByteArrayInputStream
. EDIT: It looks like ZipArchiveStream
doesn't read from the central directory either. I was hoping it would use markSupported
to check beforehand, but it appears not to...
EDIT: In the comments on the question, I asked where you'd read that the zip file doesn't have to contain entry data. You quoted wikipedia:
"Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures."
That's not the same as entry data being optional. It's saying that there may be extra data in awkward places, not that the entries may be missing completely. It's basically saying that the entries shouldn't be assumed to be contiguous. I could happily concede that ZipInputStream
may not be reading the central directory at the end of the file, but finding code which does that isn't the same as finding code which copes with entry data not existing.
You then write:
I might further add that whether the zip is valid or not is not my concern. Working with it is.
... which suggests you want code which will handle invalid zip files. Combined with this:
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the stream
That means you're asking for code which should handle zip files which are invalid in ways you can't even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?
Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution. It's reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say "I must be able to support all of these." Later you may need to do some work if it turns out that wasn't good enough. But to be able to support anything, however broken, simply isn't a valid requirement.
TrueZIP library provides alternative mature zip implementation.
It also features file system abstraction even for HTTP.
For example:
Path path = new TPath(new URI("http://acme.com/download/everything.zip/entry.xml"));
try (InputStream in = Files.newInputStream(path)) {
// Read archive entry contents here.
...
}
So, if you are interested only in specific entries, it would download them only, saving bandwidth and time. And you would not have to write downloading code.
See also http://truezip.java.net/faq.html#http.
I would use the Apache library commons-compress, see http://commons.apache.org/compress/
It has support for reading Zip-files via streams, there is in-depth documentation at http://commons.apache.org/compress/zip.html for a detailed documentation. It also states some limitations which are inherent in the Zip-Format.
Sample code looks as follows:
ZipArchiveInputStream zip =
new ZipArchiveInputStream(inputStream);
try {
ZipArchiveEntry entry = zip.getNextZipEntry();
while(entry != null) {
assertEquals("README", entry.getName());
...
entry = zip.getNextZipEntry();
}
} finally {
zip.close();
}
This question sounds similar to How to create a directory in memory? pseudo file system / virtual directory. Basically, my suggestion is to use a more general solution- an in-memory virtual filesystem (and I don't mean on OS level, like Linux' ramfs/tmpfs).
One example is to use the Java 7 NIO APIs, which now provide an SPI for implementing a file system via FileSystemProvider. It seems that the ShrinkWrap filesystem implements this SPI.
A more accessible option would be to use Apache Commons VFS' ram filesystem: it requires only Java 5. If you need to be compatible with Java 5 and 6, this is probably your best bet.
I first remember reading about in-memory filesystems in Java from this article, which apart from pointing out solutions like Commons VFS and JBoss Microcontainer, gives a nice example use case for the NetBeans IDE.
While an in-memory virtual filesystem is a nice general solution of avoiding the OS-level filesystem (with the associated performance benefits), it probably suffers from other disadvantages, which more specialized solutions could address. For instance, I am not sure how using this filesystem would behave when used concurrently from multiple threads. It might work fine as long as you don't access the same files, or you might need to create separate filesystems (which might be prohibitive in terms of resource usage).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With