I have a script which i'm using for copy purpose from one location to another location and the file beneath the directory structure are all .txt
files.
This script just evaluates the file size on the source and only copy if the file-size is not zero byte. However, I need to run this script in a cron
after a certain intervals to copy the any incremented data.
So, I need to know how to copy only the file content which are updated on the source file and then update the destination only for the new-contents and not just overwrite if its already present at destination.
Code:
#!/bin/python3
import os
import glob
import shutil
import datetime
def Copy_Logs():
Info_month = datetime.datetime.now().strftime("%B")
# The result of the below glob _is_ a full path
for filename in glob.glob("/data1/logs/{0}/*/*.txt".format(Info_month)):
if os.path.getsize(filename) > 0:
if not os.path.exists("/data2/logs/" + os.path.basename(filename)):
shutil.copy(filename, "/data2/logs/")
if __name__ == '__main__':
Copy_Logs()
I'm looking if there is way to use shutil()
in way rsync
works or if there is an alternative way to the code I have.
In a nutshell I need to copy only files ones if it's not already copied and then only copy the delta if source gets updated.
Note: The Info_month = datetime.datetime.now().strftime("%B")
is mandatory to keep as this determines the current directory by month name.
Edit:
Just having another raw idea if we can use filecmp
with shutil.copyfile
module to compare files and directories but i'm not getting how to fit that into the code.
import os
import glob
import filecmp
import shutil
import datetime
def Copy_Logs():
Info_month = datetime.datetime.now().strftime("%B")
for filename in glob.glob("/data1/logs/{0}/*/*.txt".format(Info_month)):
if os.path.getsize(filename) > 0:
if not os.path.exists("/data2/logs/" + os.path.basename(filename)) or not filecmp.cmp("/data2/logs/" + os.path.basename(filename), "/data2/logs/"):
shutil.copyfile(filename, "/data2/logs/")
if __name__ == '__main__':
Copy_Logs()
You could use Google's Diff Match Patch (you can install it with pip install diff-match-patch
) to create a diff and apply a patch from it:
import diff_match_patch as dmp_module
#...
if not os.path.exists("/data2/logs/" + os.path.basename(filename)):
shutil.copy(filename, "/data2/logs/")
else:
with open(filename) as src, open("/data2/logs/" + os.path.basename(filename),
'r+') as dst:
dmp = dmp_module.diff_match_patch()
src_text = src.read()
dst_text = dst.read()
diff = dmp.diff_main(dst_text, src_text)
if len(diff) == 1 and diff[0][0] == 0:
# No changes
continue
#make patch
patch = dmp.patch_make(dst_text, diff)
#apply it
result = dmp.patch_apply(patch, dst_text)
#write
dst.seek(0)
dst.write(result[0])
dst.truncate()
As aforementioned rsync
is a better way to do this kind of Job where you need to carry out incremental file list or say delta of the data So, i would rather prefer doing it with rsync and subprocess
module all along.
However, you can also assign a variable Curr_date_month
to get the current date, month and year as your requirement to just copy the files from the Current Month and day Folder. also you can define the source and destination variable just for the ease of writing them up into the code.
Secondly, Though you have a check for the file-size with getsize
but i would like to add an rsync option parameter --min-size=
to make sure not to copy zero byte file.
Your final code goes here.
#!/bin/python3
import os
import glob
import datetime
import subprocess
def Copy_Logs():
# Variable Declaration to get the month and Curr_date_month
Info_month = datetime.datetime.now().strftime("%B")
Curr_date_month = datetime.datetime.now().strftime("%b_%d_%y")
Sourcedir = "/data1/logs"
Destdir = "/data2/logs/"
###### End of your variable section #######################
# The result of the below glob _is_ a full path
for filename in glob.glob("{2}/{0}/{1}/*.txt".format(Info_month, Curr_date_month, Sourcedir)):
if os.path.getsize(filename) > 0:
if not os.path.exists(Destdir + os.path.basename(filename)):
subprocess.call(['rsync', '-avz', '--min-size=1', filename, Destdir ])
if __name__ == '__main__':
Copy_Logs()
One way is to save a single line to a file to keep tracking of the latest time (with the help of os.path.getctime
) you copied the files and maintain that line each time you copy.
Note: The following snippet can be optimized.
import datetime
import glob
import os
import shutil
Info_month = datetime.datetime.now().strftime("%B")
list_of_files = sorted(glob.iglob("/data1/logs/{0}/*/*.txt".format(Info_month)), key=os.path.getctime, reverse=True)
if not os.path.exists("track_modifications.txt"):
latest_file_modified_time = os.path.getctime(list_of_files[0])
for filename in list_of_files:
shutil.copy(filename, "/data2/logs/")
with open('track_modifications.txt', 'w') as the_file:
the_file.write(str(latest_file_modified_time))
else:
with open('track_modifications.txt', 'r') as the_file:
latest_file_modified_time = the_file.readline()
should_copy_files = [filename for filename in list_of_files if
os.path.getctime(filename) > float(latest_file_modified_time)]
for filename in should_copy_files:
shutil.copy(filename, "/data2/logs/")
The approach is, creating a file that contains the timestamp of the latest file that was modified by the system.
Retrieving all the files and sorting them by the modification time
list_of_files = sorted(glob.iglob('directory/*.txt'), key=os.path.getctime, reverse=True)
Initially, in if not os.path.exists("track_modifications.txt"):
I check if this file does not exists (i.e., first time to copy), then I save the largest file timestamp in
latest_file_modified_time = os.path.getctime(list_of_files[0])
And I just copy all files given and write this timestamp to the track_modifications
file.
else, the file exists (i.e., there were files copied before), I just go read that timestamp and compare it with the list of files I read in list_of_files
and retrieve all files with a larger timestamp (i.e., created after the last file I copied). That is in
should_copy_files = [filename for filename in list_of_files if os.path.getctime(filename) > float(latest_file_modified_time)]
Actually, tracking the timestamp of the latest modified files would also give you the advantage of copying the files that were already copied when they're changed :)
There are some very interesting ideas in this thread, but I will try to propose some new ideas.
Per your question, it's clear that you are using a cron job to keep track of the updated file.
If you are trying to monitor a relatively small amount of files/directories, I would propose a different approach that will simplify your life.
You can use the Linux inotify mechanism, that allows you to monitor specific files/directories and get notified whenever a file is written to.
Pro: You know of every single write immediately, without needing to check for changes. You can of course write a handler that doesn't update the destination for every write, but one in X minutes.
Here is an example that uses the inotify
python package (taken from the package's page):
import inotify.adapters
def _main():
i = inotify.adapters.Inotify()
i.add_watch('/tmp')
with open('/tmp/test_file', 'w'):
pass
for event in i.event_gen(yield_nones=False):
(_, type_names, path, filename) = event
print("PATH=[{}] FILENAME=[{}] EVENT_TYPES={}".format(
path, filename, type_names))
if __name__ == '__main__':
_main()
If you decide to use the inotify mechanism, it will be trivial to keep track of your state.
Then, there are two possibilities:
1. New contents are ALWAYS appended
If this is the case, you can simply copy anything from the your last offset till the end of the file.
2. New contents are written at random locations
In this case, I would recommend a method proposed by other answers as well: Using diff patches. This is by far the most elegant solution in my opinion.
Some options here are:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With