Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java create tar archive with entries of unknown size

I have a web app where I need to be able to serve the user an archive of multiple files. I've set up a generic ArchiveExporter, and made a ZipArchiveExporter. Works beautifully! I can stream my data to my server, and archive the data and stream it to the user all without using much memory, and without needing a filesystem (I'm on Google App Engine).

Then I remembered about the whole zip64 thing with 4gb zip files. My archives can get potentially very large (high res images), so I'd like to have an option to avoid zip files for my larger input.

I checked out org.apache.commons.compress.archivers.tar.TarArchiveOutputStream and thought I had found what I needed! Sadly when I checked the docs, and ran into some errors; I quickly found out you MUST pass the size of each entry as you stream. This is a problem because the data is being streamed to me with no way of knowing the size beforehand.

I tried counting and returning the written bytes from export(), but TarArchiveOutputStream expects a size in TarArchiveEntry before writing to it, so that obviously doesn't work.

I can use a ByteArrayOutputStream and read each entry entirely before writing its content so I know its size, but my entries can pontentially get very large; and this is not very polite to the other processes running on the instance.

I could use some form of persistence, upload the entry, and query the data size. However, that would be a waste of my google storage api calls, bandwidth, storage, and runtime.

I am aware of this SO question asking almost the same thing, but he settled for using zip files and there is no more relevant information.

What is the ideal solution to creating a tar archive with entries of unknown size?

public abstract class ArchiveExporter<T extends OutputStream> extends Exporter { //base class
    public abstract void export(OutputStream out); //from Exporter interface
    public abstract void archiveItems(T t) throws IOException;
}

public class ZipArchiveExporter extends ArchiveExporter<ZipOutputStream> { //zip class, works as intended
    @Override
    public void export(OutputStream out) throws IOException {
        try(ZipOutputStream zos = new ZipOutputStream(out, Charsets.UTF_8)) {
            zos.setLevel(0);
            archiveItems(zos);
        }
    }
    @Override
    protected void archiveItems(ZipOutputStream zos) throws IOException {
        zos.putNextEntry(new ZipEntry(exporter.getFileName()));
        exporter.export(zos);
        //chained call to export from other exporter like json exporter for instance
        zos.closeEntry();
    }
}

public class TarArchiveExporter extends ArchiveExporter<TarArchiveOutputStream> {
    @Override
    public void export(OutputStream out) throws IOException {
        try(TarArchiveOutputStream taos = new TarArchiveOutputStream(out, "UTF-8")) {
            archiveItems(taos);
        }
    }
    @Override
    protected void archiveItems(TarArchiveOutputStream taos) throws IOException {
        TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
        //entry.setSize(?);
        taos.putArchiveEntry(entry);
        exporter.export(taos);
        taos.closeArchiveEntry();
    }
}

EDIT this is what I was thinking with the ByteArrayOutputStream. It works, but I cannot guarantee I will always have enough memory to store the whole entry at once, hence my streaming efforts. There has to be a more elegant way of streaming a tarball! Maybe this is a question more suited for Code Review?

protected void byteArrayOutputStreamApproach(TarArchiveOutputStream taos) throws IOException {
    TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
    try(ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
        exporter.export(baos);
        byte[] data = baos.toByteArray();
        //holding ENTIRE entry in memory. What if it's huge? What if it has more than Integer.MAX_VALUE bytes? :[
        int len = data.length;
        entry.setSize(len);
        taos.putArchiveEntry(entry);
        taos.write(data);
        taos.closeArchiveEntry();
    }
}

EDIT This is what I meant by uploading the entry to a medium (Google Cloud Storage in this case) to accurately query the whole size. Seems like major overkill for what seems like a simple problem, but this doesn't suffer from the same ram problems as the solution above. Just at the cost of bandwidth and time. I hope someone smarter than me comes by and makes me feel stupid soon :D

protected void googleCloudStorageTempFileApproach(TarArchiveOutputStream taos) throws IOException {
    TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
    String name = NameHelper.getRandomName(); //get random name for temp storage
    BlobInfo blobInfo = BlobInfo.newBuilder(StorageHelper.OUTPUT_BUCKET, name).build(); //prepare upload of temp file
    WritableByteChannel wbc = ApiContainer.storage.writer(blobInfo); //get WriteChannel for temp file
    try(OutputStream out = Channels.newOutputStream(wbc)) {
        exporter.export(out); //stream items to remote temp file
    } finally {
        wbc.close();
    }

    Blob blob = ApiContainer.storage.get(blobInfo.getBlobId());
    long size = blob.getSize(); //accurately query the size after upload
    entry.setSize(size);
    taos.putArchiveEntry(entry);

    ReadableByteChannel rbc = blob.reader(); //get ReadChannel for temp file
    try(InputStream in = Channels.newInputStream(rbc)) {
        IOUtils.copy(in, taos); //stream back to local tar stream from remote temp file 
    } finally {
        rbc.close();
    }
    blob.delete(); //delete remote temp file

    taos.closeArchiveEntry();
}
like image 567
MeetTitan Avatar asked Oct 16 '22 10:10

MeetTitan


1 Answers

I've been looking at a similar issue, and this is a constraint of tar file format, as far as I can tell.

Tar files are written as a stream, and metadata (filenames, permissions etc) are written between the file data (i.e. metadata 1, filedata 1, metadata 2, filedata 2 etc). The program that extracts the data, it reads metadata 1, then starts extracting filedata 1, but it has to have a way of knowing when it's done. This could be done a number of ways; tar does this by having the length in the metadata.

Depending on your needs, and what the recipient expects out, there are a few options that I can see (not all apply to your situation):

  1. As you mentioned, load an entire file, work out the length, then send it.
  2. Divide the file into blocks, of predefined length (which fits into memory), then tar them up as file1-part1, file1-part2 etc.; the last block would be short.
  3. Divide the file into blocks of a predefined length (which don't need to fit into memory), then pad the last block to that size with something appropriate.
  4. Work out the maximum possible size of the file, and pad to that size.
  5. Use a different archive format.
  6. Make your own archive format, which does not have this limitation.

Interestingly, gzip does not have predefined limits, and multiple gzips can be concatenated together, each with it's own "original filename". Unfortunately, standard gunzip extracts all the resulting data into one file, using the (?) first filename.

like image 72
AMADANON Inc. Avatar answered Oct 21 '22 04:10

AMADANON Inc.