Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to copy only the changed file-contents on the already existed destination file?

I have a script which i'm using for copy purpose from one location to another location and the file beneath the directory structure are all .txt files.

This script just evaluates the file size on the source and only copy if the file-size is not zero byte. However, I need to run this script in a cron after a certain intervals to copy the any incremented data.

So, I need to know how to copy only the file content which are updated on the source file and then update the destination only for the new-contents and not just overwrite if its already present at destination.

Code:

#!/bin/python3
import os
import glob
import shutil
import datetime

def Copy_Logs():
    Info_month = datetime.datetime.now().strftime("%B")
    # The result of the below glob _is_ a full path
    for filename in glob.glob("/data1/logs/{0}/*/*.txt".format(Info_month)):
        if os.path.getsize(filename) > 0:
            if not os.path.exists("/data2/logs/" + os.path.basename(filename)):
                shutil.copy(filename, "/data2/logs/")

if __name__ == '__main__':
    Copy_Logs()

I'm looking if there is way to use shutil() in way rsync works or if there is an alternative way to the code I have.

In a nutshell I need to copy only files ones if it's not already copied and then only copy the delta if source gets updated.

Note: The Info_month = datetime.datetime.now().strftime("%B") is mandatory to keep as this determines the current directory by month name.

Edit:

Just having another raw idea if we can use filecmp with shutil.copyfile module to compare files and directories but i'm not getting how to fit that into the code.

import os
import glob
import filecmp
import shutil
import datetime

def Copy_Logs():
    Info_month = datetime.datetime.now().strftime("%B")
    for filename in glob.glob("/data1/logs/{0}/*/*.txt".format(Info_month)):
        if os.path.getsize(filename) > 0:
            if not os.path.exists("/data2/logs/" + os.path.basename(filename)) or not filecmp.cmp("/data2/logs/" + os.path.basename(filename), "/data2/logs/"):
                shutil.copyfile(filename, "/data2/logs/")

if __name__ == '__main__':
    Copy_Logs()
like image 502
krock1516 Avatar asked Jan 15 '19 10:01

krock1516


4 Answers

You could use Google's Diff Match Patch (you can install it with pip install diff-match-patch) to create a diff and apply a patch from it:

import diff_match_patch as dmp_module

#...
if not os.path.exists("/data2/logs/" + os.path.basename(filename)):
    shutil.copy(filename, "/data2/logs/")
else:
    with open(filename) as src, open("/data2/logs/" + os.path.basename(filename),
                                                                        'r+') as dst:
        dmp = dmp_module.diff_match_patch()

        src_text = src.read()
        dst_text = dst.read()

        diff = dmp.diff_main(dst_text, src_text)

        if len(diff) == 1 and diff[0][0] == 0:
            # No changes
            continue

        #make patch
        patch = dmp.patch_make(dst_text, diff)
        #apply it
        result = dmp.patch_apply(patch, dst_text)

        #write
        dst.seek(0)
        dst.write(result[0])
        dst.truncate()
like image 143
y.luis Avatar answered Oct 20 '22 16:10

y.luis


As aforementioned rsync is a better way to do this kind of Job where you need to carry out incremental file list or say delta of the data So, i would rather prefer doing it with rsync and subprocess module all along.

However, you can also assign a variable Curr_date_month to get the current date, month and year as your requirement to just copy the files from the Current Month and day Folder. also you can define the source and destination variable just for the ease of writing them up into the code.

Secondly, Though you have a check for the file-size with getsize but i would like to add an rsync option parameter --min-size= to make sure not to copy zero byte file.

Your final code goes here.

#!/bin/python3
import os
import glob
import datetime
import subprocess

def Copy_Logs():
    # Variable Declaration to get the month and Curr_date_month
    Info_month = datetime.datetime.now().strftime("%B")
    Curr_date_month = datetime.datetime.now().strftime("%b_%d_%y") 
    Sourcedir = "/data1/logs"
    Destdir = "/data2/logs/"
    ###### End of your variable section #######################
    # The result of the below glob _is_ a full path
    for filename in glob.glob("{2}/{0}/{1}/*.txt".format(Info_month, Curr_date_month, Sourcedir)):
        if os.path.getsize(filename) > 0:
            if not os.path.exists(Destdir + os.path.basename(filename)):
                subprocess.call(['rsync', '-avz', '--min-size=1', filename, Destdir ])

if __name__ == '__main__':
    Copy_Logs()
like image 33
Karn Kumar Avatar answered Oct 20 '22 15:10

Karn Kumar


One way is to save a single line to a file to keep tracking of the latest time (with the help of os.path.getctime) you copied the files and maintain that line each time you copy.

Note: The following snippet can be optimized.

import datetime
import glob
import os
import shutil

Info_month = datetime.datetime.now().strftime("%B")
list_of_files = sorted(glob.iglob("/data1/logs/{0}/*/*.txt".format(Info_month)), key=os.path.getctime, reverse=True)
if not os.path.exists("track_modifications.txt"):
    latest_file_modified_time = os.path.getctime(list_of_files[0])
    for filename in list_of_files:
            shutil.copy(filename, "/data2/logs/")
    with open('track_modifications.txt', 'w') as the_file:
        the_file.write(str(latest_file_modified_time))
else:
    with open('track_modifications.txt', 'r') as the_file:
        latest_file_modified_time = the_file.readline()
    should_copy_files = [filename for filename in list_of_files if
                         os.path.getctime(filename) > float(latest_file_modified_time)]
    for filename in should_copy_files:
            shutil.copy(filename, "/data2/logs/")

The approach is, creating a file that contains the timestamp of the latest file that was modified by the system.

Retrieving all the files and sorting them by the modification time

list_of_files = sorted(glob.iglob('directory/*.txt'), key=os.path.getctime, reverse=True)

Initially, in if not os.path.exists("track_modifications.txt"): I check if this file does not exists (i.e., first time to copy), then I save the largest file timestamp in

latest_file_modified_time = os.path.getctime(list_of_files[0])

And I just copy all files given and write this timestamp to the track_modifications file.

else, the file exists (i.e., there were files copied before), I just go read that timestamp and compare it with the list of files I read in list_of_files and retrieve all files with a larger timestamp (i.e., created after the last file I copied). That is in

should_copy_files = [filename for filename in list_of_files if os.path.getctime(filename) > float(latest_file_modified_time)]

Actually, tracking the timestamp of the latest modified files would also give you the advantage of copying the files that were already copied when they're changed :)

like image 2
ndrwnaguib Avatar answered Oct 20 '22 16:10

ndrwnaguib


There are some very interesting ideas in this thread, but I will try to propose some new ideas.

Idea no. 1: Better way for tracking updates

Per your question, it's clear that you are using a cron job to keep track of the updated file.

If you are trying to monitor a relatively small amount of files/directories, I would propose a different approach that will simplify your life.

You can use the Linux inotify mechanism, that allows you to monitor specific files/directories and get notified whenever a file is written to.

Pro: You know of every single write immediately, without needing to check for changes. You can of course write a handler that doesn't update the destination for every write, but one in X minutes.

Here is an example that uses the inotify python package (taken from the package's page):

import inotify.adapters

def _main():
    i = inotify.adapters.Inotify()

    i.add_watch('/tmp')

    with open('/tmp/test_file', 'w'):
        pass

    for event in i.event_gen(yield_nones=False):
        (_, type_names, path, filename) = event

        print("PATH=[{}] FILENAME=[{}] EVENT_TYPES={}".format(
              path, filename, type_names))

if __name__ == '__main__':
    _main()

Idea no. 2: Copying only the changes

If you decide to use the inotify mechanism, it will be trivial to keep track of your state.

Then, there are two possibilities:

1. New contents are ALWAYS appended

If this is the case, you can simply copy anything from the your last offset till the end of the file.

2. New contents are written at random locations

In this case, I would recommend a method proposed by other answers as well: Using diff patches. This is by far the most elegant solution in my opinion.

Some options here are:

  • diff-match-patch
  • diff-and-patch
like image 2
Daniel Trugman Avatar answered Oct 20 '22 15:10

Daniel Trugman