Overwriting previously extracted files instead of creating new ones

Tags:

There are a few libraries used to extract archive files through Python, such as gzip, zipfile library, rarfile, tarfile, patool etc. I found one of the libraries (patool) to be especially useful due to its cross-format feature in the sense that it can extract almost any type of archive including the most popular ones such as ZIP, GZIP, TAR and RAR.

To extract an archive file with patool it is as easy as this:

patoolib.extract_archive( "Archive.zip",outdir="Folder1")

Where the "Archive.zip" is the path of the archive file and the "Folder1" is the path of the directory where the extracted file will be stored.

The extracting works fine. The problem is that if I run the same code again for the exact same archive file, an identical extracted file will be stored in the same folder but with a slightly different name (filename at the first run, filename1 at the second, filename11 at the third and so on.

Instead of this, I need the code to overwrite the extracted file if a file under a same name already exists in the directory.

This extract_archive function looks so minimal - it only have these two parameters, a verbosity parameter, and a program parameter which specifies the program you want to extract archives with.

Edits: Nizam Mohamed's answer documented that extract_archive function is actually overwriting the output. I found out that was partially true - the function overwrites ZIP files, but not GZ files which is what I am after. For GZ files, the function still generates new files.

Edits Padraic Cunningham's answer suggested using the master source . So, I downloaded that code and replaced my old patool library scripts with the scripts in the link. Here is the result:

os.listdir()
Out[11]: ['a.gz']

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[12]: '.'

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[13]: '.'

patoolib.extract_archive("a.gz",verbosity=1,outdir=".")
patool: Extracting a.gz ...
patool: ... a.gz extracted to `.'.
Out[14]: '.'

os.listdir()
Out[15]: ['a', 'a.gz', 'a1', 'a2']

So, again, the extract_archive function is creating new files everytime it is executed. The file archived under a.gz has a different name from a actually.

626

asked Apr 14 '15 15:04

multigoodverse

1 Answers

As you've stated, patoolib is intended to be a generic archive tool.

Various archive types can be created, extracted, tested, listed, compared, searched and repacked with patool. The advantage of patool is its simplicity in handling archive files without having to remember a myriad of programs and options.

Generic Extract Behaviour vs Specific Extract Behaviour

The problem here is that extract_archive does not expose the ability to modify the underlying default behaviour of the archive tool extensively.

For a .zip extension, patoolib will use unzip. You can have the desired behaviour of extracting the archive by passing -o as an option to the command line interface. i.e. unzip -o ... However, this is a specific command line option for unzip, and this changes for each archive utility.

For example tar offers an overwrite option, but no shortened command line equivalent as zip. i.e. tar --overwrite but tar -o does not have the intended effect.

To fix this issue you could make a feature request to the author, or use an alternative library. Unfortunately, the mantra of patoolib would require extending all extract utility functions to then implement the underlying extractors own overwrite command options.

Example Changes to patoolib

In patoolib.programs.unzip

def extract_zip (archive, compression, cmd, verbosity, outdir, overwrite=False):
    """Extract a ZIP archive."""
    cmdlist = [cmd]
    if verbosity > 1:
        cmdlist.append('-v')
    if overwrite:
        cmdlist.append('-o')
    cmdlist.extend(['--', archive, '-d', outdir])
    return cmdlist

In patoolib.programs.tar

def extract_tar (archive, compression, cmd, verbosity, outdir, overwrite=False):
    """Extract a TAR archive."""
    cmdlist = [cmd, '--extract']
    if overwrite:
        cmdlist.append('--overwrite')
    add_tar_opts(cmdlist, compression, verbosity)
    cmdlist.extend(["--file", archive, '--directory', outdir])
    return cmdlist

It's not a trivial change to update every program, each program is different!

Monkey patching overwrite behavior

So you've decided to not improve the patoolib source code... We can overwrite the behaviour of extract_archive to initially look for an existing directory, remove it, then call the original extract_archive.

You could include this code in your modules, if many modules require it, perhaps stick it __init__.py

import os
import patoolib
from shutil import rmtree


def overwrite_then_extract_archive(archive, verbosity=0, outdir=None, program=None):
    if outdir:
        if os.path.exists(outdir):
            shutil.rmtree(outdir)
    patoolib.extract_archive(archive, verbosity, outdir, program)

patoolib.extract_archive = overwrite_then_extract_archive

Now when we call extract_archive() we have the functionality of overwrite_then_extract_archive().

104

answered Sep 18 '22 11:09

Matt Davidson

Related questions
                            
                                Signal handling in python-daemon
                            
                                Creating a scrolling panel in wxPython
                            
                                PyCharm 3.1 hangs forever during indexing and unusable
                            
                                streaming m3u8 file with opencv
                            
                                Python: Importing a module with the same name as a function
                            
                                How to return a relative URI Location header with Flask?
                            
                                Matplotlib tight_layout causing RuntimeError
                            
                                pip: Any workaround to avoid --allow-external?
                            
                                turning a two dimensional array into a two column dataframe pandas
                            
                                In SQLAlchemy, can I create an Engine from an existing ODBC connection?
                            
                                Distances between rankings
                            
                                PyCharm remote debugging - connects but can't start debugging
                            
                                Save numpy array as image with high precision (16 bits) with scikit-image
                            
                                how can I use selenium with my normal browser
                            
                                python - how to compute correlation-matrix with nans in data-matrix
                            
                                Numpy.dot() dimensions not aligned
                            
                                Is there a difference between RotatingFileHandler and logrotate.d + WatchedFileHandler for Python log rotation?
                            
                                Opening PNG with PIL/Pillow
                            
                                Creating databases in SQLAlchemy tests with PostgreSQL
                            
                                Why do new style class and old style class have different behavior in this case?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Overwriting previously extracted files instead of creating new ones

Tags:

python

overwrite

file

extract

ziparchive

multigoodverse

People also ask

1 Answers

Matt Davidson

Recent Activity

Donate For Us