Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to append a file to a tar file use python tarfile module?

Tags:

python

tarfile

I want to append a file to the tar file. For example, the files in test.tar.gz are a.png, b.png, c.png. I have a new png file named a.png, I want to append to a.png to test.tar.gz and cover the old file a.png in test.tar.gz. My code:

import tarfile
a = tarfile.open('test.tar.gz', 'w:gz')
a.add('a.png')
a.close()

then, all the files in test.tar.gz disappeard but a.png, if I change my code to this:

import tarfile
a = tarfile.open('test.tar.gz', 'a:')# or a:gz
a.add('a.png')
a.close()

the program is crashed, error log:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/tarfile.py", line 1678, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python2.7/tarfile.py", line 1588, in __init__
    raise ReadError(str(e))
tarfile.ReadError: invalid header

What are my mistakes and what should I do?

Update. From the documentation, it follows that gz files cannot be open in a mode. If so, what is the best way to add or update files in an existing archive?

like image 387
Karl Doenitz Avatar asked Feb 06 '15 08:02

Karl Doenitz


People also ask

How do I append a tar file?

The simplest way to add a file to an already existing archive is the ' --append ' (' -r ') operation, which writes specified files into the archive whether or not they are already among the archived files. When you use ' --append ', you must specify file name arguments, as there is no default.

What is Tarfile?

A tar (tape archive) file format is an archive created by tar, a UNIX-based utility used to package files together for backup or distribution purposes. It contains multiple files (also known as a tarball) stored in an uncompressed format along with metadata about the archive. Tar files are not compressed archive files.

What is tar in Python?

Practical Data Science using Python The 'tar' utility was originally introduced for UNIX operating system. Its purpose is to collect multiple files in a single archive file often called tarball which makes it easy to distribute the files.


2 Answers

From tarfile documentation:

Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to open a certain (compressed) file for reading, ReadError is raised. Use mode 'r' to avoid this. If a compression method is not supported, CompressionError is raised.

So I guess you should decompress it using gzip library, add the files using the a: mode in tarfile, and then compress again using gzip.

like image 153
Igor Hatarist Avatar answered Oct 23 '22 04:10

Igor Hatarist


David Dale asks:

Update. From the documentation, it follows that gz files cannot be open in a mode. If so, what is the best way to add or update files in an existing archive?

Short answer:

  1. decompress / unpack archive
  2. replace / add file(s)
  3. repack / compress archive

I tried to do it in memory using gzip's and tarfile's and file/stream interfaces but did not manage to get it running - the tarball has to be rewritten anyway, since replacing a file is apparently not possible. So it's better to just unpack the whole archive.

Wikipedia on tar, gzip.

The script, if run directly, also tries to generates the test images "a.png, b.png, c.png, new.png" (requiring Pillow) and the initial archive "test.tar.gz" if they don't exist. It then decompresses the archive into a temporary directory, overwrites "a.png" with the contents of "new.png", and packs all files, overwriting the original archive. Here are the individual files:

a.pngb.pngc.png
new.png

Of course the script's functions can also be run sequentially in interactive mode, in order to have a chance to look at the files. Assuming the script's filename is "t.py":

>>> from t import *
>>> make_images()
>>> make_archive()
>>> replace_file()
Workaround

Here we go (the essential part is in replace_file()):

#!python3
#coding=utf-8
"""
Replace a file in a .tar.gz archive via temporary files
   https://stackoverflow.com/questions/28361665/how-to-append-a-file-to-a-tar-file-use-python-tarfile-module
"""

import sys        #
import pathlib    # https://docs.python.org/3/library/pathlib.html
import tempfile   # https://docs.python.org/3/library/tempfile.html
import tarfile    # https://docs.python.org/3/library/tarfile.html
#import gzip      # https://docs.python.org/3/library/gzip.html

gfn = "test.tar.gz"
iext = ".png"

replace = "a"+iext
replacement = "new"+iext

def make_images():
    """Generate 4 test images with Pillow (PIL fork, http://pillow.readthedocs.io/)"""
    try:
        from PIL import Image, ImageDraw, ImageFont
        font = ImageFont.truetype("arial.ttf", 50)

        for k,v in {"a":"red", "b":"green", "c":"blue", "new":"orange"}.items():
            img = Image.new('RGB', (100, 100), color=v)
            d = ImageDraw.Draw(img)
            d.text((0, 0), k, fill=(0, 0, 0), font=font)
            img.save(k+iext)
    except Exception as e:
        print(e, file=sys.stderr)
        print("Could not create image files", file=sys.stderr)
        print("(pip install pillow)", file=sys.stderr)

def make_archive():
    """Create gzip compressed tar file with the three images"""
    try:
        t = tarfile.open(gfn, 'w:gz')
        for f in 'abc':
            t.add(f+iext)
        t.close()
    except Exception as e:
        print(e, file=sys.stderr)
        print("Could not create archive", file=sys.stderr)

def make_files():
    """Generate sample images and archive"""
    mi = False
    for f in ['a','b','c','new']:
        p = pathlib.Path(f+iext)
        if not p.is_file():
            mi = True
    if mi:
        make_images()
    if not pathlib.Path(gfn).is_file():
        make_archive()

def add_file_not():
    """Might even corrupt the existing file?"""
    print("Not possible: tarfile with \"a:gz\" - failing now:", file=sys.stderr)
    try:
        a = tarfile.open(gfn, 'a:gz')  # not possible!
        a.add(replacement, arcname=replace)
        a.close()
    except Exception as e:
        print(e, file=sys.stderr)

def replace_file():
    """Extract archive to temporary directory, replace file, replace archive """
    print("Workaround", file=sys.stderr)

    # tempdir
    with tempfile.TemporaryDirectory() as td:
        # dirname to Path
        tdp = pathlib.Path(td)

        # extract archive to temporry directory
        with tarfile.open(gfn) as r:
            r.extractall(td)

        # print(list(tdp.iterdir()), file=sys.stderr)

        # replace target in temporary directory
        (tdp/replace).write_bytes( pathlib.Path(replacement).read_bytes() )

        # replace archive, from all files in tempdir
        with tarfile.open(gfn, "w:gz") as w:
            for f in tdp.iterdir():
                w.add(f, arcname=f.name)
    #done

def test():
    """as the name suggests, this just runs some tests ;-)"""
    make_files()
    #add_file_not()
    replace_file()

if __name__ == "__main__":
    test()

If you want to add files instead of replacing them, obviously just omit the line that replaces the temporary file, and copy the additional files into the temp directory. Make sure that pathlib.Path.iterdir then also "sees" the new files to be added to the new archive.


I've put this in a somewhat more useful function:

def targz_add(targz=None, src=None, dst=None, replace=False):
    """Add <src> file(s) to <targz> file, optionally replacing existing file(s).
    Uses temporary directory to modify archive contents.
    TODO: complete error handling...
    """
    import sys, pathlib, tempfile, tarfile

    # ensure targz exists
    tp = pathlib.Path(targz)
    if not tp.is_file():
        sys.stderr.write("Target '{}' does not exist!\n".format(tp) )
        return 1

    # src path(s)
    if not src:
        sys.stderr.write("No files given.\n")
        return 1
    # ensure iterable of string(s)
    if not isinstance(src, (tuple, list, set)):
        src = [src]
    # ensure path(s) exist
    srcp = []
    for s in src:
        sp = pathlib.Path(s)
        if not sp.is_file():
            sys.stderr.write("Source '{}' does not exist.\n".format(sp) )
        else:
            srcp.append(sp)

    if not srcp:
        sys.stderr.write("None of the files exist.\n")
        return 1

    # dst path(s) (filenames in archive)
    dstp = []
    if not dst:
        # default: use filename only
        dstp = [sp.name for sp in srcp]
    else:
        if callable(dst):
            # map dst to each Path, ensure results are Path
            dstp = [pathlib.Path(c) for c in map(dst, srcp)]
        elif not isinstance(dst, (tuple, list, set)):
            # ensure iterable of string(s)
            dstp = [pathlib.Path(dst).name]
        elif isinstance(dst, (tuple, list, set)):
            # convert each string to Path
            dstp = [pathlib.Path(d) for d in dst]
        else:
            # TODO directly support iterable of (src,dst) tuples
            sys.stderr.write("Please fix me, I cannot handle the destination(s) '{}'\n".format(dst) )
            return 1

    if not dstp:
        sys.stderr.write("None of the files exist.\n")
        return 1

    # combine src and dst paths
    sdp = zip(srcp, dstp) # iterator of tuples

    # temporary directory
    with tempfile.TemporaryDirectory() as tempdir:
        tempdirp = pathlib.Path(tempdir)

        # extract original archive to temporry directory
        with tarfile.open(tp) as r:
            r.extractall(tempdirp)

        # copy source(s) to target in temporary directory, optionally replacing it
        for s,d in sdp:
            dp = tempdirp/d

            # TODO extend to allow flag individually
            if not dp.is_file or replace:
                sys.stderr.write("Writing '{1}' (from '{0}')\n".format(s,d) )
                dp.write_bytes( s.read_bytes() )
            else:
                sys.stderr.write("Skipping '{1}' (from '{0}')\n".format(s,d) )

        # replace original archive with new archive from all files in tempdir
        with tarfile.open(tp, "w:gz") as w:
            for f in tempdirp.iterdir():
                w.add(f, arcname=f.name)

    return None

And a few "tests" as example:

# targz_add("test.tar.gz", "new.png", "a.png")
# targz_add("test.tar.gz", "new.png", "a.png", replace=True)
# targz_add("test.tar.gz", ["new.png"], "a.png")
# targz_add("test.tar.gz", "new.png", ["a.png"], replace=True)
targz_add("test.tar.gz", "new.png", lambda x:str(x).replace("new","a"), replace=True)

shutil also supports archives, but not adding files to one:

https://docs.python.org/3/library/shutil.html#archiving-operations

New in version 3.2.
Changed in version 3.5: Added support for the xztar format.
High-level utilities to create and read compressed and archived files are also provided. They rely on the zipfile and tarfile modules.


Here's adding a file by extracting to memory using io.BytesIO, adding, and compressing:

import io
import gzip
import tarfile

gfn = "test.tar.gz"
replace = "a.png"
replacement = "new.png"

print("reading {}".format(gfn))
m = io.BytesIO()
with gzip.open(gfn) as g:
    m.write(g.read())

print("opening tar in memory")
m.seek(0)
with tarfile.open(fileobj=m, mode="a") as t:
    t.list()
    print("adding {} as {}".format(replacement, replace))
    t.add(replacement, arcname=replace)
    t.list()

print("writing {}".format(gfn))
m.seek(0)
with gzip.open(gfn, "wb") as g:
    g.write(m.read())

it prints

reading test.tar.gz
opening tar in memory
?rw-rw-rw- 0/0        877 2018-04-11 07:38:57 a.png 
?rw-rw-rw- 0/0        827 2018-04-11 07:38:57 b.png 
?rw-rw-rw- 0/0        787 2018-04-11 07:38:57 c.png 
adding new.png as a.png
?rw-rw-rw- 0/0        877 2018-04-11 07:38:57 a.png 
?rw-rw-rw- 0/0        827 2018-04-11 07:38:57 b.png 
?rw-rw-rw- 0/0        787 2018-04-11 07:38:57 c.png 
-rw-rw-rw- 0/0       2108 2018-04-11 07:38:57 a.png 
writing test.tar.gz

Optimizations are welcome!

like image 5
handle Avatar answered Oct 23 '22 03:10

handle