Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete files from a ZIP archive without Decompressing in Java or maybe Python

Tags:

java

python

zip

Delete files from a ZIP archive without decompressing using Java (Preferred) or Python

Hi,

I work with large ZIP files containing many hundreds of highly compressed text files. When I decompress the ZIP file it can take a while and easily consume up to 20 GB of diskspace. I would like to remove certain files from these ZIP files without having to decompress and recompress only the files I want.

Of course it is certainly possible to do this the long way, but very inefficient.

I would prefer to do this in Java, but will consider Python

like image 264
SeanDav Avatar asked Mar 09 '11 11:03

SeanDav


2 Answers

I've found this on web

clean solution with only standard library, but I'm not sure whether it's included in android sdk, to be found.

import java.util.*;
import java.net.URI;
import java.nio.file.Path;
import java.nio.file.*;
import java.nio.file.StandardCopyOption;
public class ZPFSDelete {
    public static void main(String [] args) throws Exception {

        /* Define ZIP File System Properies in HashMap */    
        Map<String, String> zip_properties = new HashMap<>(); 
        /* We want to read an existing ZIP File, so we set this to False */
        zip_properties.put("create", "false"); 

        /* Specify the path to the ZIP File that you want to read as a File System */
        URI zip_disk = URI.create("jar:file:/my_zip_file.zip");

        /* Create ZIP file System */
        try (FileSystem zipfs = FileSystems.newFileSystem(zip_disk, zip_properties)) {
            /* Get the Path inside ZIP File to delete the ZIP Entry */
            Path pathInZipfile = zipfs.getPath("source.sql");
            System.out.println("About to delete an entry from ZIP File" + pathInZipfile.toUri() ); 
            /* Execute Delete */
            Files.delete(pathInZipfile);
            System.out.println("File successfully deleted");   
        } 
    }
}
like image 92
Valen Avatar answered Oct 26 '22 17:10

Valen


I don't have code to do this, but the basic idea is simple and should translate into almost any language the same way. The ZIP file layout is just a series of blocks that represent files (a header followed by the compressed data), finished off with a central directory that just contains all the metadata. Here's the process:

  1. Scan forward in the file until you find the first file you want to delete.
  2. Scan forward in the file until you find the first file you don't want to delete or you hit the central directory.
  3. Scan forward in the file until you find the first file you want to delete or you hit the central directory.
  4. Copy all the data you found in step 3 back onto the data you skipped in step 2 until you find another file you want to delete or you hit the central directory.
  5. Go to step 2 unless you've hit the central directory.
  6. Copy the central directory to where ever you left off copying, leaving out the entries for the deleted files and changing the offsets to reflect how much you moved each file.

See http://en.wikipedia.org/wiki/ZIP_%28file_format%29 for all the details on the ZIP file structures.

As bestsss suggests, you might want to perform the copying into another file, so as to prevent losing data in the event of a failure.

like image 45
Gabe Avatar answered Oct 26 '22 18:10

Gabe