Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Among the many Python file copy functions, which ones are safe if the copy is interrupted?

As seen in How do I copy a file in Python?, there are many file copy functions:

  • shutil.copy

  • shutil.copy2

  • shutil.copyfile (and also shutil.copyfileobj)

  • or even a naive method:

    with open('sourcefile', 'rb') as f, open('destfile', 'wb') as g:
        while True:
            block = f.read(16*1024*1024)  # work by blocks of 16 MB
            if not block:  # EOF
                break
            g.write(block)
    

Among all these methods, which ones are safe in the case of a copy interruption (example: kill the Python process)? The last one in the list looks ok.

By safe I mean: if a 1 GB file copy is not 100% finished (let's say it's interrupted in the middle of the copy, after 400MB), the file size should not be reported as 1GB in the filesystem, it should:

  • either report the size the file had when the last bytes were written (e.g. 400MB)
  • or be deleted

The worst would be that the final filesize is written first (internally with an fallocate or ftruncate?). This would be a problem if the copy is interrupted: by looking at the file-size, we would think the file is correctly written.

Many incremental backup programs (I'm coding one) use "filename+mtime+fsize" to check if a file has to be copied or if it's already there (of course a better solution is to SHA256 source and destination files but this is not done for every sync, too much time-consuming; off-topic here).

So I want to make sure that the "copy file" function does not store the final file size immediately (then it could fool the fsize comparison), before copying the actual file content.


Note: I'm asking the question because, while shutil.filecopy was rather straighforward on Python 3.7 and below, see source (which is more or less the naive method above), it seems much more complicated on Python 3.9, see source, with many different cases for Windows, Linux, MacOS, "fastcopy" tricks, etc.

like image 441
Basj Avatar asked Dec 10 '20 11:12

Basj


People also ask

Which is very useful for copying the multiple source files into one destination file by making use of less code in Python?

The copy2() method in Python is used to copy the content of the source file to the destination file or directory. This method is identical to shutil. copy() method also preserving the file's metadata.

What's the difference between Shutil copy and Shutil copy?

The shutil.copy2() method is identical to shutil.copy() except that copy2() attempts to preserve file metadata as well.

What does CP do in Python?

run cp command to make a copy of a file or change a file name in Python.


1 Answers

Assuming that destfile does not exist prior to the copy, the naive method is safe, per your definition of safe.

shutil.copyfileobj() and shutil.copyfile() are close second in the ranking.

shutils.copy() is next, and shutils.copy2() would be last.

Explanation:

It is a filesystem's job to guarantee consistency based on application requests. If you are only writing X bytes to a file, the file size will only account for these X bytes.

Therefore, doing direct FS operations like the naive method will work.

It is now a matter of what these higher-level functions do with the filesystem.

The API doesn't state what happens if python crashes mid-copy, but it is a de-facto expectation from everyone that these functions behave like Unix cp, i.e. don't mess with the file size.

Assuming that the maintainers of CPython don't want to break people's expectations, all these functions should be safe per your definition.

That said, it isn't guaranteed anywhere, AFAICT.

However, shutil.copyfileobj() and shutil.copyfile() expressly have their API promise to not copy metadata, so they're not likely to try and set the size.

shutils.copy() wouldn't try to set the file size, only the mode, and in most filesystems setting the size and the mode require two different filesystem operations, so it should still be safe.

shutils.copy2() says it will copy metadata, and if you look at its source code, you'll see that it only copies the metadata after copying the data, so even that should be safe. Even more, copying the metadata doesn't copy the size.

So this would only be a problem if some of the internal functions python uses try to optimize using ftruncate(), fallocate(), or some such, which is unlikely given that people who write system APIs (like the python maintainers) are very aware of people's expectations.

like image 189
root Avatar answered Oct 16 '22 09:10

root