I want to append a file to the tar file. For example, the files in test.tar.gz
are a.png, b.png, c.png
. I have a new png file named a.png
, I want to append to a.png
to test.tar.gz
and cover the old file a.png
in test.tar.gz
. My code:
import tarfile
a = tarfile.open('test.tar.gz', 'w:gz')
a.add('a.png')
a.close()
then, all the files in test.tar.gz
disappeard but a.png
, if I change my code to this:
import tarfile
a = tarfile.open('test.tar.gz', 'a:')# or a:gz
a.add('a.png')
a.close()
the program is crashed, error log:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/tarfile.py", line 1678, in open
return func(name, filemode, fileobj, **kwargs)
File "/usr/lib/python2.7/tarfile.py", line 1705, in taropen
return cls(name, mode, fileobj, **kwargs)
File "/usr/lib/python2.7/tarfile.py", line 1588, in __init__
raise ReadError(str(e))
tarfile.ReadError: invalid header
What are my mistakes and what should I do?
Update. From the documentation, it follows that gz
files cannot be open in a
mode. If so, what is the best way to add or update files in an existing archive?
The simplest way to add a file to an already existing archive is the ' --append ' (' -r ') operation, which writes specified files into the archive whether or not they are already among the archived files. When you use ' --append ', you must specify file name arguments, as there is no default.
A tar (tape archive) file format is an archive created by tar, a UNIX-based utility used to package files together for backup or distribution purposes. It contains multiple files (also known as a tarball) stored in an uncompressed format along with metadata about the archive. Tar files are not compressed archive files.
Practical Data Science using Python The 'tar' utility was originally introduced for UNIX operating system. Its purpose is to collect multiple files in a single archive file often called tarball which makes it easy to distribute the files.
From tarfile
documentation:
Note that
'a:gz'
or'a:bz2'
is not possible. If mode is not suitable to open a certain (compressed) file for reading,ReadError
is raised. Use mode 'r' to avoid this. If a compression method is not supported,CompressionError
is raised.
So I guess you should decompress it using gzip
library, add the files using the a:
mode in tarfile
, and then compress again using gzip
.
David Dale asks:
Update. From the documentation, it follows that
gz
files cannot be open ina
mode. If so, what is the best way to add or update files in an existing archive?
Short answer:
I tried to do it in memory using gzip
's and tarfile
's and file/stream interfaces but did not manage to get it running - the tarball has to be rewritten anyway, since replacing a file is apparently not possible. So it's better to just unpack the whole archive.
Wikipedia on tar, gzip.
The script, if run directly, also tries to generates the test images "a.png, b.png, c.png, new.png" (requiring Pillow) and the initial archive "test.tar.gz" if they don't exist. It then decompresses the archive into a temporary directory, overwrites "a.png" with the contents of "new.png", and packs all files, overwriting the original archive. Here are the individual files:
Of course the script's functions can also be run sequentially in interactive mode, in order to have a chance to look at the files. Assuming the script's filename is "t.py":
>>> from t import *
>>> make_images()
>>> make_archive()
>>> replace_file()
Workaround
Here we go (the essential part is in replace_file()
):
#!python3
#coding=utf-8
"""
Replace a file in a .tar.gz archive via temporary files
https://stackoverflow.com/questions/28361665/how-to-append-a-file-to-a-tar-file-use-python-tarfile-module
"""
import sys #
import pathlib # https://docs.python.org/3/library/pathlib.html
import tempfile # https://docs.python.org/3/library/tempfile.html
import tarfile # https://docs.python.org/3/library/tarfile.html
#import gzip # https://docs.python.org/3/library/gzip.html
gfn = "test.tar.gz"
iext = ".png"
replace = "a"+iext
replacement = "new"+iext
def make_images():
"""Generate 4 test images with Pillow (PIL fork, http://pillow.readthedocs.io/)"""
try:
from PIL import Image, ImageDraw, ImageFont
font = ImageFont.truetype("arial.ttf", 50)
for k,v in {"a":"red", "b":"green", "c":"blue", "new":"orange"}.items():
img = Image.new('RGB', (100, 100), color=v)
d = ImageDraw.Draw(img)
d.text((0, 0), k, fill=(0, 0, 0), font=font)
img.save(k+iext)
except Exception as e:
print(e, file=sys.stderr)
print("Could not create image files", file=sys.stderr)
print("(pip install pillow)", file=sys.stderr)
def make_archive():
"""Create gzip compressed tar file with the three images"""
try:
t = tarfile.open(gfn, 'w:gz')
for f in 'abc':
t.add(f+iext)
t.close()
except Exception as e:
print(e, file=sys.stderr)
print("Could not create archive", file=sys.stderr)
def make_files():
"""Generate sample images and archive"""
mi = False
for f in ['a','b','c','new']:
p = pathlib.Path(f+iext)
if not p.is_file():
mi = True
if mi:
make_images()
if not pathlib.Path(gfn).is_file():
make_archive()
def add_file_not():
"""Might even corrupt the existing file?"""
print("Not possible: tarfile with \"a:gz\" - failing now:", file=sys.stderr)
try:
a = tarfile.open(gfn, 'a:gz') # not possible!
a.add(replacement, arcname=replace)
a.close()
except Exception as e:
print(e, file=sys.stderr)
def replace_file():
"""Extract archive to temporary directory, replace file, replace archive """
print("Workaround", file=sys.stderr)
# tempdir
with tempfile.TemporaryDirectory() as td:
# dirname to Path
tdp = pathlib.Path(td)
# extract archive to temporry directory
with tarfile.open(gfn) as r:
r.extractall(td)
# print(list(tdp.iterdir()), file=sys.stderr)
# replace target in temporary directory
(tdp/replace).write_bytes( pathlib.Path(replacement).read_bytes() )
# replace archive, from all files in tempdir
with tarfile.open(gfn, "w:gz") as w:
for f in tdp.iterdir():
w.add(f, arcname=f.name)
#done
def test():
"""as the name suggests, this just runs some tests ;-)"""
make_files()
#add_file_not()
replace_file()
if __name__ == "__main__":
test()
If you want to add files instead of replacing them, obviously just omit the line that replaces the temporary file, and copy the additional files into the temp directory. Make sure that pathlib.Path.iterdir
then also "sees" the new files to be added to the new archive.
I've put this in a somewhat more useful function:
def targz_add(targz=None, src=None, dst=None, replace=False):
"""Add <src> file(s) to <targz> file, optionally replacing existing file(s).
Uses temporary directory to modify archive contents.
TODO: complete error handling...
"""
import sys, pathlib, tempfile, tarfile
# ensure targz exists
tp = pathlib.Path(targz)
if not tp.is_file():
sys.stderr.write("Target '{}' does not exist!\n".format(tp) )
return 1
# src path(s)
if not src:
sys.stderr.write("No files given.\n")
return 1
# ensure iterable of string(s)
if not isinstance(src, (tuple, list, set)):
src = [src]
# ensure path(s) exist
srcp = []
for s in src:
sp = pathlib.Path(s)
if not sp.is_file():
sys.stderr.write("Source '{}' does not exist.\n".format(sp) )
else:
srcp.append(sp)
if not srcp:
sys.stderr.write("None of the files exist.\n")
return 1
# dst path(s) (filenames in archive)
dstp = []
if not dst:
# default: use filename only
dstp = [sp.name for sp in srcp]
else:
if callable(dst):
# map dst to each Path, ensure results are Path
dstp = [pathlib.Path(c) for c in map(dst, srcp)]
elif not isinstance(dst, (tuple, list, set)):
# ensure iterable of string(s)
dstp = [pathlib.Path(dst).name]
elif isinstance(dst, (tuple, list, set)):
# convert each string to Path
dstp = [pathlib.Path(d) for d in dst]
else:
# TODO directly support iterable of (src,dst) tuples
sys.stderr.write("Please fix me, I cannot handle the destination(s) '{}'\n".format(dst) )
return 1
if not dstp:
sys.stderr.write("None of the files exist.\n")
return 1
# combine src and dst paths
sdp = zip(srcp, dstp) # iterator of tuples
# temporary directory
with tempfile.TemporaryDirectory() as tempdir:
tempdirp = pathlib.Path(tempdir)
# extract original archive to temporry directory
with tarfile.open(tp) as r:
r.extractall(tempdirp)
# copy source(s) to target in temporary directory, optionally replacing it
for s,d in sdp:
dp = tempdirp/d
# TODO extend to allow flag individually
if not dp.is_file or replace:
sys.stderr.write("Writing '{1}' (from '{0}')\n".format(s,d) )
dp.write_bytes( s.read_bytes() )
else:
sys.stderr.write("Skipping '{1}' (from '{0}')\n".format(s,d) )
# replace original archive with new archive from all files in tempdir
with tarfile.open(tp, "w:gz") as w:
for f in tempdirp.iterdir():
w.add(f, arcname=f.name)
return None
And a few "tests" as example:
# targz_add("test.tar.gz", "new.png", "a.png")
# targz_add("test.tar.gz", "new.png", "a.png", replace=True)
# targz_add("test.tar.gz", ["new.png"], "a.png")
# targz_add("test.tar.gz", "new.png", ["a.png"], replace=True)
targz_add("test.tar.gz", "new.png", lambda x:str(x).replace("new","a"), replace=True)
shutil
also supports archives, but not adding files to one:
https://docs.python.org/3/library/shutil.html#archiving-operations
New in version 3.2.
Changed in version 3.5: Added support for the xztar format.
High-level utilities to create and read compressed and archived files are also provided. They rely on the zipfile and tarfile modules.
Here's adding a file by extracting to memory using io.BytesIO, adding, and compressing:
import io
import gzip
import tarfile
gfn = "test.tar.gz"
replace = "a.png"
replacement = "new.png"
print("reading {}".format(gfn))
m = io.BytesIO()
with gzip.open(gfn) as g:
m.write(g.read())
print("opening tar in memory")
m.seek(0)
with tarfile.open(fileobj=m, mode="a") as t:
t.list()
print("adding {} as {}".format(replacement, replace))
t.add(replacement, arcname=replace)
t.list()
print("writing {}".format(gfn))
m.seek(0)
with gzip.open(gfn, "wb") as g:
g.write(m.read())
it prints
reading test.tar.gz
opening tar in memory
?rw-rw-rw- 0/0 877 2018-04-11 07:38:57 a.png
?rw-rw-rw- 0/0 827 2018-04-11 07:38:57 b.png
?rw-rw-rw- 0/0 787 2018-04-11 07:38:57 c.png
adding new.png as a.png
?rw-rw-rw- 0/0 877 2018-04-11 07:38:57 a.png
?rw-rw-rw- 0/0 827 2018-04-11 07:38:57 b.png
?rw-rw-rw- 0/0 787 2018-04-11 07:38:57 c.png
-rw-rw-rw- 0/0 2108 2018-04-11 07:38:57 a.png
writing test.tar.gz
Optimizations are welcome!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With