Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to elegantly compare zip folder contents to unzipped folder contents

This is the scenario. I want to be able to backup the contents of a folder using a python script. However, I want my backups to be stored in a zipped format, possibly bz2.

The problem comes from the fact that I don’t want to bother backing up the folder if the contents in the “current” folder are exactly the same as what is in my most recent backup.

My process will be like this:

  1. Initiate backup
  2. Check contents of “current” folder against what is stored in the most recent zipped backup
  3. If same – then “complete”
  4. If different, then run backup, then “complete”

Can anyone recomment the most reliable and simple way of completing step2? Do I have to unzip the contents of the backup and store in a temp directory to do a comparison or is there a more elegant way of doing this? Possibly to do with modified date?

like image 285
Jimmy Avatar asked Nov 19 '12 09:11

Jimmy


People also ask

What's the difference between a zipped folder and a normal one?

ZIP files work in much the same way as a standard folder on your computer. They contain data and files together in one place. But with zipped files, the contents are compressed, which reduces the amount of data used by your computer. Another way to describe ZIP files is as an archive.


4 Answers

Zip files contain CRC32 checksums and you can read them with the python zipfile module: http://docs.python.org/2/library/zipfile.html. You can get a list of ZipInfo objects with CRC members from ZipFile.infolist(). There are also modification dates in the ZipInfo object.

You can compare the zip checksum with calculated checksums for the unpacked files. You need to read the unpacked files but you avoid having to decompress everything.

CRC32 is not a cryptographic checksum but it should be enough if all you need is to check for changes.

This holds for zip files. Other archive formats (like tar.bz2) might not contain such easily-accessible metadata.

like image 83
cdleonard Avatar answered Oct 06 '22 04:10

cdleonard


Rsync will automatically detect and only copy modified files, but seeing as you want to bzip the results, you still need to detect if anything has changed.

How about you output the directory listing (including time stamps) to a text file alongside your archive. The next time you diff the current directory structure against this stored text. You can grep differences out and pipe this file list to rsync to include those changed files.

like image 23
invert Avatar answered Oct 06 '22 06:10

invert


You could also try the following process:

1) Initiate backup

2) Run backup

3) Compare both compressed files:

import filecmp
filecmp.cmp(Compressed_new_file, Compressed_old_file, shallow=True)

4) If same – delete new backup file then "complete"

5) Else “complete”

NOTE: In case you need to check just the time between the modifications, you can have a look at this documentation

Rather than decompressing the folder and comparing individual files, I think it might be easier to compare the compressed files. Overall I feel (ok, its just an intuition :D) this will be better in case there is a high probability that the contents of the folder changes in between the times you run the script

like image 27
Pulimon Avatar answered Oct 06 '22 06:10

Pulimon


I use this script to create compress backup of a directory only when the directory contents has changed after last backup.

I use external md5 file to store the digest of the backup file and I check it to detect directory changes.

import hashlib
import tarfile
import bz2
import cStringIO
import os

def backup_dir(dirname, backup_path):
    fobj = cStringIO.StringIO()
    t = tarfile.open(mode='w',fileobj=fobj)
    t.add(dirname)
    t.close()
    buf = fobj.getvalue()
    new_md5 = hashlib.md5(buf).digest()

    if os.path.isfile(backup_path + '.md5'):
        old_md5 = open(backup_path + '.md5').read()
    else:
        old_md5 = ''

    if new_md5 <> old_md5:
        open(backup_path, 'wb').write(bz2.compress(buf))
        open(backup_path + '.md5', 'wb').write(new_md5)
        print 'backup done!'
    else:
        print 'nothing to do'
like image 29
gieffe Avatar answered Oct 06 '22 04:10

gieffe