Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overwriting previously extracted files instead of creating new ones

There are a few libraries used to extract archive files through Python, such as gzip, zipfile library, rarfile, tarfile, patool etc. I found one of the libraries (patool) to be especially useful due to its cross-format feature in the sense that it can extract almost any type of archive including the most popular ones such as ZIP, GZIP, TAR and RAR.

To extract an archive file with patool it is as easy as this:

patoolib.extract_archive( "Archive.zip",outdir="Folder1")

Where the "Archive.zip" is the path of the archive file and the "Folder1" is the path of the directory where the extracted file will be stored.

The extracting works fine. The problem is that if I run the same code again for the exact same archive file, an identical extracted file will be stored in the same folder but with a slightly different name (filename at the first run, filename1 at the second, filename11 at the third and so on.

Instead of this, I need the code to overwrite the extracted file if a file under a same name already exists in the directory.

This extract_archive function looks so minimal - it only have these two parameters, a verbosity parameter, and a program parameter which specifies the program you want to extract archives with.

Edits: Nizam Mohamed's answer documented that extract_archive function is actually overwriting the output. I found out that was partially true - the function overwrites ZIP files, but not GZ files which is what I am after. For GZ files, the function still generates new files.

Edits Padraic Cunningham's answer suggested using the master source . So, I downloaded that code and replaced my old patool library scripts with the scripts in the link. Here is the result:

os.listdir()
Out[11]: ['a.gz']

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[12]: '.'

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[13]: '.'

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[14]: '.'

os.listdir()
Out[15]: ['a', 'a.gz', 'a1', 'a2']

So, again, the extract_archive function is creating new files everytime it is executed. The file archived under a.gz has a different name from a actually.

like image 626
multigoodverse Avatar asked Apr 14 '15 15:04

multigoodverse


People also ask

Does tar extract overwrite existing files?

Overwrite existing files and directory metadata when extracting files from an archive. This causes tar to write extracted files into the file system without regard to the files already on the system; i.e., files with the same names as archive members are overwritten when the archive is extracted.


1 Answers

As you've stated, patoolib is intended to be a generic archive tool.

Various archive types can be created, extracted, tested, listed, compared, searched and repacked with patool. The advantage of patool is its simplicity in handling archive files without having to remember a myriad of programs and options.

Generic Extract Behaviour vs Specific Extract Behaviour

The problem here is that extract_archive does not expose the ability to modify the underlying default behaviour of the archive tool extensively.

For a .zip extension, patoolib will use unzip. You can have the desired behaviour of extracting the archive by passing -o as an option to the command line interface. i.e. unzip -o ... However, this is a specific command line option for unzip, and this changes for each archive utility.

For example tar offers an overwrite option, but no shortened command line equivalent as zip. i.e. tar --overwrite but tar -o does not have the intended effect.

To fix this issue you could make a feature request to the author, or use an alternative library. Unfortunately, the mantra of patoolib would require extending all extract utility functions to then implement the underlying extractors own overwrite command options.

Example Changes to patoolib

In patoolib.programs.unzip

def extract_zip (archive, compression, cmd, verbosity, outdir, overwrite=False):
    """Extract a ZIP archive."""
    cmdlist = [cmd]
    if verbosity > 1:
        cmdlist.append('-v')
    if overwrite:
        cmdlist.append('-o')
    cmdlist.extend(['--', archive, '-d', outdir])
    return cmdlist

In patoolib.programs.tar

def extract_tar (archive, compression, cmd, verbosity, outdir, overwrite=False):
    """Extract a TAR archive."""
    cmdlist = [cmd, '--extract']
    if overwrite:
        cmdlist.append('--overwrite')
    add_tar_opts(cmdlist, compression, verbosity)
    cmdlist.extend(["--file", archive, '--directory', outdir])
    return cmdlist

It's not a trivial change to update every program, each program is different!

Monkey patching overwrite behavior

So you've decided to not improve the patoolib source code... We can overwrite the behaviour of extract_archive to initially look for an existing directory, remove it, then call the original extract_archive.

You could include this code in your modules, if many modules require it, perhaps stick it __init__.py

import os
import patoolib
from shutil import rmtree


def overwrite_then_extract_archive(archive, verbosity=0, outdir=None, program=None):
    if outdir:
        if os.path.exists(outdir):
            shutil.rmtree(outdir)
    patoolib.extract_archive(archive, verbosity, outdir, program)

patoolib.extract_archive = overwrite_then_extract_archive

Now when we call extract_archive() we have the functionality of overwrite_then_extract_archive().

like image 104
Matt Davidson Avatar answered Sep 18 '22 11:09

Matt Davidson